این آگهی منقضی
شده است
● Collaborate with software development teams to ensure the reliability, performance, and high availability of production systems and services.
● Identify and proactively address potential issues that could impact system reliability, including capacity planning and incident response.
● Develop and implement automated solutions to enhance system reliability and performance.
● Participate in on-call rotations and respond to emergencies promptly.
● Assist in developing and maintaining service level indicators (SLIs) and service level objectives (SLOs).
● Apply best practices and principles of site reliability engineering throughout the software development lifecycle.
● Monitor and analyze system metrics to proactively identify and resolve potential issues, ensuring high availability and minimal downtime.
● Design and maintain comprehensive monitoring and alerting systems for quick detection and response to incidents.
● Conduct thorough post-incident reviews and root cause analysis, implementing preventive measures to minimize future occurrences.
● Automate operational tasks and processes to improve efficiency and reduce manual effort.
● Implement and maintain disaster recovery and business continuity plans to ensure the integrity and availability of critical systems.
● Provide support and guidance to development teams in designing and deploying applications in production environments.
● Stay up-to-date with industry trends, best practices, and actively contribute to infrastructure and operations improvement
Requirements:
● Solid understanding of networking principles, protocols, and troubleshooting techniques.
● Knowledge of distributed systems, microservices architecture, and cloud-native technologies.
● Proficiency with operating systems, networking, and computer systems architecture.
● Experience with technologies such as Nginx, HAProxy, Chef, Ansible, Terraform, GitLab CI/CD, Docker, Kubernetes, or similar.
● Familiarity with programming languages.
● Experience with monitoring and observability tools like Prometheus, Grafana, and New Relic.
● Strong incident management skills, including the ability to triage and resolve issues affecting system reliability and performance.
● Familiarity with error budgeting concepts and the ability to prioritize and allocate error budget for optimal system reliability and availability.
● Knowledge of database administration and performance tuning.
● Proficient in troubleshooting and resolving performance bottlenecks and complex system issues.
● Minimum of 3 years of experience in a similar role, preferably in a large-scale, production environment.
● Bachelor's degree in computer science, software engineering, or a related field. A master's degree is a plus
اسنپفود بزرگترین سرویس آنلاین سفارش غذا در ایرانه که در کنار غذا، سرویسهایی از جمله سفارش نان، پروتئین، شیرینی و میوه رو هم در خودش داره.
همراهی صمیمانه و اعتماد بیش از ۵ میلیون کاربر ما رو بر این داشته که همواره به دنبال خلق پدیدههای تازه و راهی برای خدمترسانی بهتر و باکیفیتتر باشیم.
ما در این مسیر علاقهمند به همکاری با افرادی هستیم که با هوشمندی و سرعت عملشون در عبور از چالشها و مسائل کسبوکار یاریگرمون باشن.