EOS RPO
Senior Systems Operations Engineer
Required Qualifications:
4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Desired Qualifications:
Strong experience in large-scale distributed systems; 5+ years hands-on SRE/DevOps/Platform Engineering.
Cloud: One or more—AWS / Azure / GCP (certifications a plus).
IaC & Automation: Terraform, Ansible/Chef; solid Git practices (GitOps
Observability: Prometheus, Grafana, OpenTelemetry, Thousandeyes, Appdynamics, Aternity.
CI/CD: Azure DevOps, GitHub Actions, Jenkins, or GitLab CI; artifact mgmt and environment promotions.
Programming: One of Python/Go/Java (scripting + API integrations).
Reliability Practices: SLIs/SLOs, error budgets, capacity planning, canary/bluegreen, chaos/DR testing.
Processes: Incident/Problem/Change, blameless postmortems, runbook design, oncall good practices. Strong documentation and communication skills
Job Expectations:
Define and implement SLIs/SLOs and error budgets for critical services; drive SLO adoption across teams.
Build and tune observability (metrics/logs/traces) with golden signals (latency, traffic, errors, saturation).
Partner with Performance Engineering to run load/stress/soak tests and remove performance bottlenecks.
Platform & Automation: Eliminate toil , Generate AI based observability assessment and maturity score card for all applications
Create selfservice reliability tooling (runbooks, bots, reliability checks, golden paths).
Incident, Problem & Change
Lead high severity incidents (Major/SEV1), facilitate blameless postmortems, and track corrective actions.
Culture & Enablement: Coach product and ops teams on SRE principles; define maturity models and track adoption.
Build documentation: runbooks, dashboards, readiness checklists, and reliability reviews. always current.