Lead impactful projects at a top-tier organization with innovative technologies. Collaborate with skilled professionals in a dynamic and supportive environment. Advance your career with exceptional growth opportunities.
Site Reliability Engineer
in Information Technology PermanentJob Detail
Job Description
Overview
- Design, build, and maintain systems ensuring application reliability, scalability, and performance.
- Apply software engineering principles to operations, automating processes and enhancing infrastructure.
- Collaborate with development teams to ensure reliability and scalability of features and services.
- Participate in on-call rotations addressing production issues and incident responses.
- Develop and maintain tools supporting efficient and reliable operations.
- Ensure compliance with security standards and regularly update policies.
- Contribute to capacity planning and implement auto-scaling strategies for future demands.
- Enhance CI/CD pipelines ensuring seamless deployment processes.
- Promote DevOps best practices to improve software development lifecycle.
Key Responsibilities & Duties
- Design and maintain scalable infrastructure using Terraform for infrastructure as code.
- Deploy and optimize Kubernetes clusters and containerized applications using Docker.
- Develop monitoring solutions ensuring visibility into system performance and application health.
- Define and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Respond to incidents, perform root cause analysis, and implement preventive solutions.
- Automate processes and develop scripts using Python, Bash, or Go.
- Enhance and maintain CI/CD pipelines using GitLab CI.
- Implement security best practices and ensure compliance with industry standards.
- Collaborate with teams to support infrastructure-related issues and reliability improvements.
Job Requirements
- Bachelor of Science (BS) degree in a relevant field.
- Minimum of 3 years of experience in SRE, DevOps, or related roles.
- Proficiency with cloud platforms like AWS, GCP, or Azure.
- Strong experience with Kubernetes and Docker for containerized applications.
- Expertise in Terraform for infrastructure as code and configuration management tools.
- Extensive experience with monitoring tools like Datadog, Prometheus, or Grafana.
- Proven ability to define and maintain SLOs and SLAs.
- Strong scripting skills in Python, Bash, or Go for automation.
- Experience with CI/CD practices and tools like GitLab CI.
- ShareAustin: