اسنپ فود | Snappfood

تاسیس در ۱۳۸۸ کامپیوتر، فناوری اطلاعات و اینترنت بیش از ۱۰۰۰ نفر snappfood.ir

استخدام Observability Engineer

  • دسته‌بندی شغلی

    IT / DevOps / Server
  • موقعیت مکانی

    تهران ، تهران
  • نوع همکاری

    تمام وقت
  • حداقل سابقه کار

    سه تا شش سال
  • حقوق

    توافقی

شرح موقعیت شغلی

About Snappfood

At Snappfood, we believe in creating value that goes beyond the ordinary. We embrace innovation and continuously challenge ourselves to build reliable and scalable technology that serves millions of users every day.

We are looking for an experienced Observability Engineer to join our Production Reliability & Operations team and help us improve the reliability, visibility, and operational excellence of our production platforms. If you enjoy solving complex operational problems, building monitoring solutions, and enabling engineering teams with better observability, we would love to have you continue this story with us.


Role Summary:

As an Observability Engineer, you will be responsible for designing, implementing, and continuously improving monitoring, alerting, and observability practices across our production systems. You will work closely with engineering teams to ensure that services are measurable, actionable, and operationally mature.

You will play a key role in improving incident detection, reducing Mean Time to Detect (MTTD), and enabling faster and more effective incident response.


Responsibilities:

Monitoring & Observability

  • Design, implement, and maintain monitoring solutions for applications, infrastructure, and business-critical services.
  • Build and maintain dashboards, service health indicators, and operational reports.
  • Define and promote observability standards, including metrics, logs, traces, and service instrumentation.
  • Ensure critical systems have adequate monitoring coverage and operational visibility.
  • Continuously improve telemetry quality and monitoring effectiveness.
Alert Engineering

  • Design and maintain actionable alerts and escalation policies.
  • Reduce alert fatigue by improving signal-to-noise ratio and eliminating duplicate or low-value alerts.
  • Define alert standards and thresholds based on service reliability objectives.
  • Develop proactive monitoring mechanisms to identify issues before they impact customers.
Incident Detection & Response

  • Continuously monitor production environments and respond to operational incidents.
  • Participate in incident response activities and support major incident investigations.
  • Analyze monitoring data during incidents to assist troubleshooting and root cause identification.
  • Collaborate with engineering teams to implement preventive actions and improve service resilience.
Reliability Improvement

  • Identify monitoring gaps and recommend improvements to system reliability and operational readiness.
  • Partner with engineering teams to improve instrumentation, observability, and service maturity.
  • Support the implementation of Service Level Indicators (SLIs), Service Level Objectives (SLOs), and reliability reporting.
Documentation & Reporting

  • Maintain monitoring documentation, runbooks, dashboards, and operational procedures.
  • Produce reports on service health, incidents, alert trends, and monitoring coverage.
  • Ensure incident records and operational documentation remain accurate and up to date.
Operational Support

  • Participate in a 24/7 shift rotation to ensure continuous operational visibility and timely incident response.
  • Participate in on-call rotations and emergency response activities when required.
Requirements

  • 3+ years of experience in Observability Engineering, Site Reliability Engineering (SRE), Production Operations, NOC, Systems Engineering, or related fields.
  • Experience operating and supporting production systems in a 24/7 environment.
  • Hands-on experience with monitoring, troubleshooting, and incident response processes.
  • Strong experience with monitoring and observability platforms such as: Prometheus, Grafana, Zabbix
  • Experience with centralized logging solutions such as: ELK, Loki, Splunk
  • Familiarity with distributed tracing and observability concepts, including: OpenTelemetry, Tempo, 
  • Experience configuring: Dashboards, Alerts, Service health reports, Monitoring automation
  • Solid understanding of Linux/Unix systems and troubleshooting methodologies.
  • Good understanding of networking fundamentals and distributed systems concepts.
  • Familiarity with cloud-native environments and container platforms is a plus.
Preferred Qualifications

  • Experience with Kubernetes and containerized environments.
  • Understanding of SLI/SLO concepts and reliability engineering practices.
  • Experience with automation and scripting using Python, Bash, or Go.
  • Experience working in high-traffic, mission-critical production environments.

معرفی شرکت

اسنپ‌فود، بزرگ‌ترین سرویس آنلاین سفارش غذا در ایران است. ما با ارائهٔ خدمات متنوعی مثل سفارش نان، پروتئین، شیرینی و میوه، تلاش می‌کنیم تا تجربه‌ای کامل، راحت و سریع برای کاربران‌مان بسازیم.
اعتماد و همراهی چند میلیون کاربر، ما را دلگرم کرده تا همیشه به دنبال نوآوری و ارائه‌ی خدماتی با کیفیت بالاتر باشیم.
در این مسیر، مشتاق همکاری با افرادی هستیم که با ذهنی چابک، روحیه‌ٔ یادگیرنده و انگیزه‌ٔ بالا، بتوانند در کنار ما از چالش‌ها عبور کنند و در ساختن آیندهٔ بهتر نقش داشته باشند.
  • مهارت‌های مورد نیاز

    Linux مانیتورینگ ZABBIX
  • جنسیت

    مهم نیست
  • وضعیت نظام وظیفه

    مهم‌ نیست
  • حداقل مدرک تحصیلی

    کارشناسی

مشاغل مشابه

چه موردی را می‌خواهید گزارش کنید؟

از اینجا شروع کنید

در شغل بهتری استخدام شوید! رایگان!

  • جستجو و ارسال رزومه به آگهی‌های استخدام بیش از ۱۰۰,۰۰۰ شرکت ایرانی
  • رزومه‌ساز رایگان
  • دریافت فرصت‌های شغلی جدید مرتبط از طریق ایمیل (Job Alert)
  • شناخت محیط کار و فرهنگ سازمانی شرکت‌های در حال استخدام
image/svg+xml