Lead impactful DevOps projects for AI/ML systems at a top-tier organization. Collaborate on cutting-edge technologies to drive innovation and scalability. Enhance your expertise in advanced AI infrastructure and operations.
Senior Devops Engineer
in Information Technology PermanentJob Detail
Job Description
Overview
- Lead the development and operation of advanced DevOps pipelines supporting AI/ML lifecycle processes.
- Collaborate with multidisciplinary teams to deliver scalable, secure, and efficient AI platforms.
- Develop and maintain CI/CD pipelines for AI/ML services and cloud-based infrastructure.
- Automate infrastructure provisioning using Infrastructure as Code tools like Terraform and Kubernetes.
- Ensure reliability, scalability, and observability of AI platforms and workloads.
- Implement security, compliance, and governance requirements for AI systems and processes.
- Support production workloads for Generative AI systems and LLM-based services.
- Document standards, best practices, and processes for DevOps and AI infrastructure.
Key Responsibilities & Duties
- Design and operate scalable DevOps pipelines for model lifecycle automation and deployment.
- Develop cloud-native infrastructure on AWS using Kubernetes and containerized workloads.
- Implement model versioning, artifact management, and experiment tracking systems.
- Ensure system health monitoring, model performance tracking, and drift detection.
- Collaborate with AI/ML engineers to standardize deployment patterns and practices.
- Participate in incident response, troubleshooting, and continuous improvement initiatives.
- Optimize costs for compute-intensive workloads while ensuring scalability and efficiency.
- Document reference architectures and best practices for AI infrastructure and operations.
Job Requirements
- Bachelor of Science degree in a relevant field is required.
- 10+ years of experience in DevOps, SRE, or Platform Engineering roles.
- Proficiency in AWS cloud services, Kubernetes, and CI/CD pipeline development.
- Hands-on experience with Terraform and scripting/programming languages like Python.
- Experience with MLOps platforms, model registries, and experiment tracking systems.
- Exposure to Generative AI workloads and LLM-based services in production environments.
- Strong communication skills and ability to work effectively across teams.
- AWS certifications are preferred but not mandatory.
- ShareAustin: