At Snappfood, we believe in creating value that goes beyond the ordinary. We embrace innovation and continuously challenge ourselves to build reliable and scalable technology that serves millions of users every day.
We are looking for an experienced Observability Engineer to join our Production Reliability & Operations team and help us improve the reliability, visibility, and operational excellence of our production platforms. If you enjoy solving complex operational problems, building monitoring solutions, and enabling engineering teams with better observability, we would love to have you continue this story with us.
Role Summary:
As an Observability Engineer, you will be responsible for designing, implementing, and continuously improving monitoring, alerting, and observability practices across our production systems. You will work closely with engineering teams to ensure that services are measurable, actionable, and operationally mature.
You will play a key role in improving incident detection, reducing Mean Time to Detect (MTTD), and enabling faster and more effective incident response.
Responsibilities:
Monitoring & Observability
Design, implement, and maintain monitoring solutions for applications, infrastructure, and business-critical services.
Build and maintain dashboards, service health indicators, and operational reports.
Define and promote observability standards, including metrics, logs, traces, and service instrumentation.
Ensure critical systems have adequate monitoring coverage and operational visibility.
Continuously improve telemetry quality and monitoring effectiveness.
Alert Engineering
Design and maintain actionable alerts and escalation policies.
Reduce alert fatigue by improving signal-to-noise ratio and eliminating duplicate or low-value alerts.
Define alert standards and thresholds based on service reliability objectives.
Develop proactive monitoring mechanisms to identify issues before they impact customers.
Incident Detection & Response
Continuously monitor production environments and respond to operational incidents.
Participate in incident response activities and support major incident investigations.
Analyze monitoring data during incidents to assist troubleshooting and root cause identification.
Collaborate with engineering teams to implement preventive actions and improve service resilience.
Reliability Improvement
Identify monitoring gaps and recommend improvements to system reliability and operational readiness.
Partner with engineering teams to improve instrumentation, observability, and service maturity.
Support the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and reliability reporting.
Documentation & Reporting
Maintain monitoring documentation, runbooks, dashboards, and operational procedures.
Produce reports on service health, incidents, alert trends, and monitoring coverage.
Ensure incident records and operational documentation remain accurate and up to date.
Operational Support
Participate in a 24/7 shift rotation to ensure continuous operational visibility and timely incident response.
Participate in on-call rotations and emergency response activities when required.
Requirements
3+ years of experience in Observability Engineering, Site Reliability Engineering (SRE), Production Operations, NOC, Systems Engineering, or related fields.
Experience operating and supporting production systems in a 24/7 environment.
Hands-on experience with monitoring, troubleshooting, and incident response processes.
Strong experience with monitoring and observability platforms such as: Prometheus, Grafana, Zabbix
Experience with centralized logging solutions such as: ELK, Loki, Splunk
Familiarity with distributed tracing and observability concepts, including: OpenTelemetry, Tempo,
Experience configuring: Dashboards, Alerts, Service health reports, Monitoring automation
Solid understanding of Linux/Unix systems and troubleshooting methodologies.
Good understanding of networking fundamentals and distributed systems concepts.
Familiarity with cloud-native environments and container platforms is a plus.
Preferred Qualifications
Experience with Kubernetes and containerized environments.
Understanding of SLI/SLO concepts and reliability engineering practices.
Experience with automation and scripting using Python, Bash, or Go.
Experience working in high-traffic, mission-critical production environments.
اسنپفود، بزرگترین سرویس آنلاین سفارش غذا در ایران است. ما با ارائهٔ خدمات متنوعی مثل سفارش نان، پروتئین، شیرینی و میوه، تلاش میکنیم تا تجربهای کامل، راحت و سریع برای کاربرانمان بسازیم.
اعتماد و همراهی چند میلیون کاربر، ما را دلگرم کرده تا همیشه به دنبال نوآوری و ارائهی خدماتی با کیفیت بالاتر باشیم.
در این مسیر، مشتاق همکاری با افرادی هستیم که با ذهنی چابک، روحیهٔ یادگیرنده و انگیزهٔ بالا، بتوانند در کنار ما از چالشها عبور کنند و در ساختن آیندهٔ بهتر نقش داشته باشند.