استخدام Senior SRE Engineer
شرح موقعیت شغلی
Job Description
In the Story of Snappfood, we believe in creating value that goes beyond the ordinary. We are wiling to establish innovative tendencies and are eager to have you on our team to help us get through our business challenges with creativity, intelligence, and agility.
We are waiting for you to continue this story.
We are waiting for you to continue this story.
Responsibilities:
● Collaborate with software development teams to ensure the reliability, performance, and high availability of production systems and services.
● Identify and proactively address potential issues that could impact system reliability, including capacity planning and incident response.
● Develop and implement automated solutions to enhance system reliability and performance.
● Participate in on-call rotations and respond to emergencies promptly.
● Assist in developing and maintaining service level indicators (SLIs) and service level objectives (SLOs).
● Apply best practices and principles of site reliability engineering throughout the software development lifecycle.
● Monitor and analyze system metrics to proactively identify and resolve potential issues, ensuring high availability and minimal downtime.
● Design and maintain comprehensive monitoring and alerting systems for quick detection and response to incidents.
● Conduct thorough post-incident reviews and root cause analysis, implementing preventive measures to minimize future occurrences.
● Automate operational tasks and processes to improve efficiency and reduce manual effort.
● Implement and maintain disaster recovery and business continuity plans to ensure the integrity and availability of critical systems.
● Provide support and guidance to development teams in designing and deploying applications in production environments.
● Stay up-to-date with industry trends, best practices, and actively contribute to infrastructure and operations improvement
● Identify and proactively address potential issues that could impact system reliability, including capacity planning and incident response.
● Develop and implement automated solutions to enhance system reliability and performance.
● Participate in on-call rotations and respond to emergencies promptly.
● Assist in developing and maintaining service level indicators (SLIs) and service level objectives (SLOs).
● Apply best practices and principles of site reliability engineering throughout the software development lifecycle.
● Monitor and analyze system metrics to proactively identify and resolve potential issues, ensuring high availability and minimal downtime.
● Design and maintain comprehensive monitoring and alerting systems for quick detection and response to incidents.
● Conduct thorough post-incident reviews and root cause analysis, implementing preventive measures to minimize future occurrences.
● Automate operational tasks and processes to improve efficiency and reduce manual effort.
● Implement and maintain disaster recovery and business continuity plans to ensure the integrity and availability of critical systems.
● Provide support and guidance to development teams in designing and deploying applications in production environments.
● Stay up-to-date with industry trends, best practices, and actively contribute to infrastructure and operations improvement
Requirements:
● Solid understanding of networking principles, protocols, and troubleshooting techniques.
● Knowledge of distributed systems, microservices architecture, and cloud-native technologies.
● Proficiency with operating systems, networking, and computer systems architecture.
● Experience with technologies such as Nginx, HAProxy, Chef, Ansible, Terraform, GitLab CI/CD, Docker, Kubernetes, or similar.
● Familiarity with programming languages.
● Experience with monitoring and observability tools like Prometheus, Grafana, and New Relic.
● Strong incident management skills, including the ability to triage and resolve issues affecting system reliability and performance.
● Familiarity with error budgeting concepts and the ability to prioritize and allocate error budget for optimal system reliability and availability.
● Knowledge of database administration and performance tuning.
● Proficient in troubleshooting and resolving performance bottlenecks and complex system issues.
● Minimum of 5 years of experience in a similar role, preferably in a large-scale, production environment.
● Bachelor's degree in computer science, software engineering, or a related field. A master's degree is a plus
● Knowledge of distributed systems, microservices architecture, and cloud-native technologies.
● Proficiency with operating systems, networking, and computer systems architecture.
● Experience with technologies such as Nginx, HAProxy, Chef, Ansible, Terraform, GitLab CI/CD, Docker, Kubernetes, or similar.
● Familiarity with programming languages.
● Experience with monitoring and observability tools like Prometheus, Grafana, and New Relic.
● Strong incident management skills, including the ability to triage and resolve issues affecting system reliability and performance.
● Familiarity with error budgeting concepts and the ability to prioritize and allocate error budget for optimal system reliability and availability.
● Knowledge of database administration and performance tuning.
● Proficient in troubleshooting and resolving performance bottlenecks and complex system issues.
● Minimum of 5 years of experience in a similar role, preferably in a large-scale, production environment.
● Bachelor's degree in computer science, software engineering, or a related field. A master's degree is a plus
Benefits:
● Vouchers for vacation, Gym, Therapy Sessions.
● Complementary Insurance.
● Educational platform of advanced courses.
● Snappfood's Discount codes.
● Loans.
● Complementary Insurance.
● Educational platform of advanced courses.
● Snappfood's Discount codes.
● Loans.
مهارتهای مورد نیاز
- SRE
- عیب یابی
- Grafana
- پایگاه داده
حداقل سابقه کار
- بیش از شش سال
جنسیت
- مهم نیست
وضعیت نظام وظیفه
- مهم نیست