EOS RPO

Lead Observability Engineer

Posted Apr 7, 2026
Project ID: R-531615
Location
Hyderabad, Telangana
Hours/week
40 hrs/week

In this role, you will:

  • Own the architecture, reliability, and scalability of enterprise logging platforms.

  • Lead design and implementation of high‑volume, resilient log ingestion pipelines across hybrid and cloud environments.

  • Define and enforce logging standards, schemas, and governance aligned with enterprise observability strategy.

  • Design and integrate AI/ML models for anomaly detection, log classification, predictive alerting, and signal enrichment.

  • Build and operationalize agentic AI systems capable of:

    • Autonomous log analysis and root‑cause hypothesis generation

    • Context‑aware remediation recommendations

    • Intelligent correlation across logs, metrics, and traces

  • Partner with platform and SRE teams to embed AI‑driven insights into incident response workflows.

  • Develop self‑service onboarding, configuration, and compliance automation for logging consumers.

  • Enable OpenTelemetry‑aligned ingestion patterns and standardized integrations.

  • Drive automation to reduce manual toil and improve MTTR across application and infrastructure observability.

  • Ensure platform availability, performance, and data quality through proactive monitoring and SLI/SLO ownership.

  • Lead production issue resolution, RCA analysis, and continuous improvement initiatives.

  • Partner with security and compliance teams to support auditability, retention, and access controls.

Required Qualifications:

  • 10+ years of experience in software engineering or platform engineering, with at least 3+ years in a lead role.

  • Deep hands‑on expertise with Splunk (search, data models, dashboards, alerts, ES, APIs, ingestion patterns).

  • Strong experience designing distributed, high‑throughput data platforms.

  • Proven experience applying machine learning to operational data (logs, metrics, events).

  • Hands‑on experience with agentic AI frameworks or autonomous agents (LLM‑based or rule‑driven).

  • Strong understanding of prompt engineering, tool‑using agents, feedback loops, and guardrails.

  • Proficiency in one or more languages: Python, Java, Go, or Scala.

  • Experience with cloud platforms, containerization, and Kubernetes/OpenShift.

  • Familiarity with OpenTelemetry, observability standards, and telemetry correlation.

Desired Qualifications:

  • Worked on a large Splunk infrastructure, including clustered environments, multi-site deployments, and cloud/SAAS deployment.

  • Exposure to containerization and orchestration tools (Docker, Kubernetes).

  • Familiarity with DevOps practices and CI/CD pipelines.

  • Certifications in Splunk, Cribl, or cloud technologies (AWS, Azure).

  • Experience applying AI/ML techniques to operational or telemetry data.

Job Expectations:

  • Develop complex dashboards, reports, and alerts tailored to business and operational needs.

  • Develop migration strategies, including data ingestion, configuration, and app compatibility assessments.

  • Design data pipelines to optimize Splunk ingestion, reduce licensing costs, and improve system performance.

  • Proactively identify and resolve bottlenecks in ingestion, indexing, and search processes.

Similar jobs

+ Search all jobs