آگهی‌های استخدامی

استخدام SRE Engineer

اسنپ فود | Snappfood
تهران، تهران

شرح موقعیت شغلی

● Collaborate with software development teams to ensure the reliability, performance, and high availability of production systems and services.
● Identify and proactively address potential issues that could impact system reliability, including capacity planning and incident response.
● Develop and implement automated solutions to enhance system reliability and performance.
● Participate in on-call rotations and respond to emergencies promptly.
● Assist in developing and maintaining service level indicators (SLIs) and service level objectives (SLOs).
● Apply best practices and principles of site reliability engineering throughout the software development lifecycle.
● Monitor and analyze system metrics to proactively identify and resolve potential issues, ensuring high availability and minimal downtime.
● Design and maintain comprehensive monitoring and alerting systems for quick detection and response to incidents.
● Conduct thorough post-incident reviews and root cause analysis, implementing preventive measures to minimize future occurrences.
● Automate operational tasks and processes to improve efficiency and reduce manual effort.
● Implement and maintain disaster recovery and business continuity plans to ensure the integrity and availability of critical systems.
● Provide support and guidance to development teams in designing and deploying applications in production environments.
● Stay up-to-date with industry trends, best practices, and actively contribute to infrastructure and operations improvement

Requirements:

● Solid understanding of networking principles, protocols, and troubleshooting techniques. 
● Knowledge of distributed systems, microservices architecture, and cloud-native technologies. 
● Proficiency with operating systems, networking, and computer systems architecture. 
● Experience with technologies such as Nginx, HAProxy, Chef, Ansible, Terraform, GitLab CI/CD, Docker, Kubernetes, or similar. 
● Familiarity with programming languages. 
● Experience with monitoring and observability tools like Prometheus, Grafana, and New Relic. 
● Strong incident management skills, including the ability to triage and resolve issues affecting system reliability and performance. 
● Familiarity with error budgeting concepts and the ability to prioritize and allocate error budget for optimal system reliability and availability. 
● Knowledge of database administration and performance tuning. 
● Proficient in troubleshooting and resolving performance bottlenecks and complex system issues. 
● Minimum of 3 years of experience in a similar role, preferably in a large-scale, production environment. 
● Bachelor's degree in computer science, software engineering, or a related field. A master's degree is a plus 

مهارت‌های مورد نیاز

  • SRE
  • CICD
  • Docker

حداقل سابقه کار

  • سه تا شش سال

جنسیت

  • مهم نیست

وضعیت نظام وظیفه

  • مهم‌ نیست

نوع همکاری:

تمام وقت

دسته‌بندی شغلی:

IT / DevOps / Server

تاریخ انتشار آگهی:

۱۴۰۲/۰۳/۱۰ (منقضی‌شده)
مشاهده آگهی‌های استخدام مشابه