آگهی‌های استخدامی

استخدام SRE Lead

گروه اسنپ | Snapp Group
تهران، تهران

شرح موقعیت شغلی

We are looking for an experienced Site Reliability Engineer (SRE) Lead to join our team at Snapp! Express. As the SRE Lead, you will be responsible for leading and managing our SRE team to ensure the availability, reliability, and performance of the systems that power our logistics platform. You will work closely with cross-functional teams to optimize our infrastructure, automate processes, and improve incident management practices. Your technical expertise, leadership skills, and strategic mindset will be essential in driving operational excellence and ensuring the seamless functioning of our systems.  

Duties
:

•   Lead and manage a team of SREs, providing direction, guidance, and mentorship to ensure the success of the team and the reliability of our systems.
•     Oversee the design, implementation, and maintenance of monitoring, alerting, and reporting solutions using industry-leading tools such as Grafana, Prometheus, and other relevant technologies.
•     Collaborate with development, operations, and other cross-functional teams to identify and implement system improvements, including performance optimizations, capacity planning, automation of repetitive tasks, and incident management processes.
•     Drive incident management efforts, including root cause analysis, incident response, and post-incident reviews, to identify and address systemic issues and prevent future incidents.
•     Develop and maintain documentation, including runbooks, playbooks, and other technical documentation, to ensure effective knowledge sharing and incident resolution.
•     Stay up-to-date with the latest industry trends and best practices in SRE, DevOps, automation, and incident management, and proactively apply this knowledge to improve Snapp! Express systems.
•     Provide technical leadership and expertise in the evaluation and implementation of new technologies, tools, and processes to optimize system performance and reliability.
•     Foster a culture of continuous improvement, innovation, and accountability within the SRE team and across the organization.
•     Collaborate with HR and management in recruiting, hiring, and onboarding SRE talent.
•     Conduct performance evaluations, provide feedback, and identify professional development opportunities for team members.


Requirements 

•     Bachelor's degree in computer science, engineering, or a related field.
•     Minimum of 3 years of experience in Site Reliability Engineering or similar roles, with a track record of leading and managing teams in a fast-paced, production environment.
•     Deep expertise in monitoring, alerting, and reporting tools, such as Grafana, Prometheus, and other relevant technologies.
•     Proficiency in scripting and automation using tools such as Python, Bash, or similar languages.
•     Strong understanding of Linux-based operating systems, including performance tuning, troubleshooting, and security.
•     Excellent problem-solving skills and ability to analyze and resolve complex technical issues in a timely manner.
•     Excellent communication and leadership skills, with the ability to effectively lead and inspire a team.
•     Strong organizational and project management skills, with the ability to prioritize and manage multiple tasks and projects simultaneously.
•     Proven experience in incident management, root cause analysis, and post-incident reviews.
•     Ability to work in a collaborative, cross-functional environment and drive change across teams.




مهارت‌های مورد نیاز

  • SRE
  • Bash
  • Grafana
  • DevOps

حداقل سابقه کار

  • سه تا شش سال

جنسیت

  • مهم نیست

وضعیت نظام وظیفه

  • مهم‌ نیست

نوع همکاری:

تمام وقت

دسته‌بندی شغلی:

IT / DevOps / Server

تاریخ انتشار آگهی:

۱۴۰۲/۰۲/۱۹ (منقضی‌شده)
مشاهده آگهی‌های استخدام مشابه