استخدام (Site Reliability Engineer(Infra Team
شرح موقعیت شغلی
**Responsibilities:**
- Collaborate with software development teams to ensure the reliability, performance, and high availability of production systems and services.
- Identify and proactively address potential issues that could impact system reliability, including capacity planning and incident response.
- Participate in on-call rotations and respond to emergencies promptly.
- Assist in developing and maintaining service level indicators (SLIs) and service level objectives (SLOs).
- Apply best practices and principles of site reliability engineering throughout the software development lifecycle.
- Monitor and analyze system metrics to proactively identify and resolve potential issues, ensuring high availability and minimal downtime.
- Design and maintain comprehensive monitoring and alerting systems for quick detection and response to incidents.
- Conduct thorough post-incident reviews and root cause analysis, implementing preventive measures to minimize future occurrences.
- Automate operational tasks and processes to improve efficiency and reduce manual effort.
- Implement and maintain disaster recovery and business continuity plans to ensure the integrity and availability of critical systems.
- Participate in on-call rotation.
- Identify recurring issues and work with IT & business partners to remediate using the problem management process.
**Requirements:**
- Solid understanding of networking principles, protocols, and troubleshooting techniques.
- Knowledge of distributed systems, microservices architecture, and cloud-native technologies.
- Proficiency with operating systems, networking, and computer systems architecture.
- Experience with technologies such as Nginx, HAProxy, GitLab CI/CD, Docker, Kubernetes, or similar.
- Familiarity with programming languages (e.g., Bash, Python, .NET, Node.js).
- Experience with monitoring and observability tools like Prometheus, Grafana, and Zabbix.
- Strong incident management skills, including the ability to triage and resolve issues affecting system reliability and performance.
- Familiarity with error budgeting concepts and the ability to prioritize and allocate error budget for optimal system reliability and availability.
- Knowledge of database administration and performance tuning.
- Proficient in troubleshooting and resolving performance bottlenecks and complex system issues.
- Strong background in Linux/Unix and Windows server administration.
- Strong communication skills and the ability to work effectively across multiple technical teams.
- Good self-learning and research skills (ability to find an answer to a question or a solution to solve a problem).
- Good team-working skills.
- Strong documentation and reporting skills.
- Minimum of 3 years of experience in a similar role, preferably in a large-scale, production environment.
**Employment Type:** Full-Time
**Employment Type:** Full-Time
**Salary:** Negotiable – Starting from 25 million IRR, depending on technical interview
**Age Requirement:** Under 30 years - Preferably male
**Work Hours:** On-site and shift-based
**Work Hours:** On-site and shift-based
مهارتهای مورد نیاز
- reliability
- irr
- Python
حداقل سابقه کار
- کمتر از سه سال
حقوق
- حقوق از ۲۴,۰۰۰,۰۰۰ تومان
جنسیت
- مهم نیست
وضعیت نظام وظیفه
- مهم نیست