About The Team
The NOC Team Lead will oversee and guide the Operation and Performance Monitoring team responsible for managing, maintaining, and evolving the organization’s monitoring tools and practices. This includes ensuring application and infrastructure uptime and performance while paving the way for a future self-service monitoring stack that empowers other development teams.
The role requires a balanced combination of technical expertise in monitoring technologies and leadership skills to manage the team and align with organizational goals.
Technical Responsibilities
● Monitoring Tools and Infrastructure:
○ Oversee the configuration, deployment, and maintenance of monitoring stacks (e.g., Prometheus, Grafana, ELK, Zabbix, or similar tools).
○ Design and implement reliable alerting systems for infrastructure and application layers.
○ Optimize and scale monitoring systems to accommodate growing data and infrastructure needs.
● Incident Management and Analysis:
○ Lead efforts to detect, diagnose, and communicate performance and uptime issues across applications and infrastructure.
○ Contribute detailed root cause analysis (RCA) for critical incidents and implement preventive measures.
● Monitoring Strategy Development:
○ Define best practices for monitoring across diverse platforms, ensuring high coverage and effectiveness
○ Architect and lead the development of a self-service monitoring stack, enabling development teams to integrate and maintain their monitoring metrics autonomously.
● Collaboration with Development Teams:
○ Work closely with application teams to identify and monitor critical system metrics.
○ Develop MaaS(Monitoring as a Service) and related templates for self-service monitoring integration.
Managerial Responsibilities
● Team Leadership:
○ Manage and mentor team members, ensuring skill development and alignment with the team’s objectives.
○ Foster a culture of collaboration, continuous improvement, and technical excellence.
● Project and Task Management:
○ Plan, prioritize, and oversee the team’s work, ensuring timely and high-quality deliverables.
○ Manage resource allocation and identify skill gaps within the team.
● Strategic Alignment:
○ Collaborate with other technical and managerial leads to align monitoring objectives with organizational goals.
○ Advocate for monitoring as a critical component of system reliability and operational efficiency.
● Stakeholder Communication:
○ Act as the point of contact between the OPM team and other technical teams or leadership.
- Provide regular updates on system performance, incidents, and the progress of monitoring initiatives.
Qualifications
Technical Skills:
● Proven experience with monitoring tools and technologies (e.g., Prometheus, Grafana, Elastic Stack, etc.).
● Strong understanding of infrastructure and application performance metrics.
● Expertise in scripting and automation (e.g., Python, Bash) for monitoring system setup and maintenance.
● Familiarity with cloud environments (e.g., AWS, Azure, GCP) and their monitoring solutions.
● Knowledge of CI/CD pipelines and integration of monitoring systems.
● Experience designing and implementing APIs for self-service systems.
Leadership Skills:
● Prior experience leading technical teams or projects.
● Ability to coach and mentor team members.
● Strong organizational skills to manage multiple tasks and projects effectively.
● Excellent communication and interpersonal skills for cross-team collaboration.
Preferred Experience
● Experience in building or managing self-service systems.
● Background in site reliability engineering (SRE).
● Knowledge of machine learning or advanced analytics for predictive monitoring.
What Success Looks Like
● Short-term: Streamlined and efficient monitoring processes for existing applications and infrastructure.
- Long-term: Deployment of a robust self-service monitoring stack, with high adoption among development teams and reduced dependency on the OPM team.
ما در دیجیکالا به عنوان شرکتی که در حوزه تجارت الکترونیک فعالیت میکنه، به دنبال تحقق رویای «لبخندی برای همه ایران» هستیم. در همین راستا، با بهرهگیری از فناوریهای روز دنیا و توسعه مداوم سرویسهای مبتنی بر تکنولوژی، ارزشهای خودمون رو در مشتریمحوری، اشتیاق برای تعالی، کارگروهی و نتیجهگرایی دنبال میکنیم.
در گروه دیجیکالا امکانی فراهم شده تا ما با افراد با تخصصهای متنوع در یک مجموعه فعالیت کنیم. علاوه بر این، با توجه به سرعت رشد بالا در دیجیکالا، امکان رشد و توسعه رو در مواجهه با چالشها و استفاده از برنامههای توسعه و آموزش متنوع داریم.