EOS RPO
Lead Observability Engineer
In this role, you will:
Own the architecture, reliability, and scalability of enterprise logging platforms.
Lead design and implementation of high‑volume, resilient log ingestion pipelines across hybrid and cloud environments.
Define and enforce logging standards, schemas, and governance aligned with enterprise observability strategy.
Design and integrate AI/ML models for anomaly detection, log classification, predictive alerting, and signal enrichment.
Build and operationalize agentic AI systems capable of:
Autonomous log analysis and root‑cause hypothesis generation
Context‑aware remediation recommendations
Intelligent correlation across logs, metrics, and traces
Partner with platform and SRE teams to embed AI‑driven insights into incident response workflows.
Develop self‑service onboarding, configuration, and compliance automation for logging consumers.
Enable OpenTelemetry‑aligned ingestion patterns and standardized integrations.
Drive automation to reduce manual toil and improve MTTR across application and infrastructure observability.
Ensure platform availability, performance, and data quality through proactive monitoring and SLI/SLO ownership.
Lead production issue resolution, RCA analysis, and continuous improvement initiatives.
Partner with security and compliance teams to support auditability, retention, and access controls.
Required Qualifications:
10+ years of experience in software engineering or platform engineering, with at least 3+ years in a lead role.
Deep hands‑on expertise with Splunk (search, data models, dashboards, alerts, ES, APIs, ingestion patterns).
Strong experience designing distributed, high‑throughput data platforms.
Proven experience applying machine learning to operational data (logs, metrics, events).
Hands‑on experience with agentic AI frameworks or autonomous agents (LLM‑based or rule‑driven).
Strong understanding of prompt engineering, tool‑using agents, feedback loops, and guardrails.
Proficiency in one or more languages: Python, Java, Go, or Scala.
Experience with cloud platforms, containerization, and Kubernetes/OpenShift.
Familiarity with OpenTelemetry, observability standards, and telemetry correlation.
Desired Qualifications:
Worked on a large Splunk infrastructure, including clustered environments, multi-site deployments, and cloud/SAAS deployment.
Exposure to containerization and orchestration tools (Docker, Kubernetes).
Familiarity with DevOps practices and CI/CD pipelines.
Certifications in Splunk, Cribl, or cloud technologies (AWS, Azure).
Experience applying AI/ML techniques to operational or telemetry data.
Job Expectations:
Develop complex dashboards, reports, and alerts tailored to business and operational needs.
Develop migration strategies, including data ingestion, configuration, and app compatibility assessments.
Design data pipelines to optimize Splunk ingestion, reduce licensing costs, and improve system performance.
Proactively identify and resolve bottlenecks in ingestion, indexing, and search processes.