Lead impactful projects at a cutting-edge technology company. Enhance your expertise in GPU-accelerated Kubernetes platforms and ML infrastructure. Collaborate with innovative teams to drive technological advancements.
Platform Engineer
in Information Technology PermanentJob Detail
Job Description
Overview
- Lead the operation and optimization of GPU-accelerated Kubernetes platforms for high-performance machine learning workloads.
- Deploy and manage Kubernetes clusters on bare-metal infrastructure with hybrid cloud capabilities for scalability.
- Design and maintain CI/CD pipelines to ensure efficient software deployment across diverse environments.
- Develop observability systems for real-time monitoring and performance optimization of infrastructure.
- Collaborate with development teams to enhance productivity through streamlined tooling and processes.
- Implement infrastructure-as-code practices using modern tools to ensure consistency and scalability.
- Manage core infrastructure components, including networking, storage, and system configurations.
- Ensure compliance with security standards and harden systems against vulnerabilities.
Key Responsibilities & Duties
- Deploy and manage Kubernetes clusters on bare-metal infrastructure supporting NVIDIA GPUs.
- Optimize GPU clusters for machine learning training workloads ensuring reliability and performance.
- Design and operate CI/CD pipelines for automated build and deployment processes.
- Develop observability stacks for real-time system health monitoring and alerting.
- Collaborate with development teams to enhance tooling for deployment efficiency.
- Implement infrastructure-as-code practices using Terraform, Helm, and Ansible.
- Manage networking, storage, and system configurations for high-performance clusters.
- Ensure systems meet defense-grade security and compliance standards.
Job Requirements
- Bachelor's degree in Computer Science or related field.
- 3-5 years of experience in platform engineering or DevOps roles.
- Proficiency in Python and Bash for automation and tooling.
- Deep knowledge of Kubernetes administration and GPU environments.
- Experience with CI/CD pipelines and observability tools.
- Strong expertise in Linux systems and infrastructure-as-code practices.
- Ability to manage complex systems and ensure security compliance.
- Preferred experience with ML orchestration tools and build toolchains.
- ShareAustin: