EOS RPO
Senior System Operations Engineer -Application Support
In this role, you will:
Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability
Required Qualifications:
4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Desired Qualifications:
4+ years in Production Support / SRE / DevOps / Platform Operations for business-critical applications.
Proven track record supporting 24x7 platforms with strict SLAs and high availability requirements.
Experience working in ITIL-aligned environments (Incident, Problem, Change).
Strong troubleshooting skills across Linux/Unix, system processes, CPU/memory, threads, disk, network basics.
Working knowledge of application architectures: microservices, distributed systems, batch + online workloads.
Proficiency in log analysis and observability tools (e.g., Splunk/ELK, Grafana, Prometheus, AppDynamics, Dynatrace—any equivalent).
Solid understanding of HTTP, TLS, DNS, load balancing, reverse proxy, and typical failure patterns (timeouts, 503/504, connection pool saturation).
Hands-on with databases (Oracle / Postgres / SQL Server etc.): query basics, locks, slow queries, connection pooling, indexing concepts.
Familiarity with messaging/streaming systems (Kafka/RabbitMQ) and troubleshooting lag/offset/consumer issues (good-to-have).
Ability to write scripts for automation in Python / Shell / PowerShell.
Comfortable with runbooks, automation tools, CI/CD basics, and reducing manual toil. Understanding of SLO/SLI, monitoring, alert tuning, and reliability best practices.
Strong incident handling skills: triage, mitigation, communication, and structured follow-through.
Knowledge of RCA techniques (5 Whys, fishbone, timeline-based analysis) and converting findings into preventive actions.
Experience with change management and release support; able to assess risk and enforce operational readiness.
Excellent written and verbal communication for stakeholder updates (technical + business-friendly). Ability to collaborate across Dev, QA, DBAs, Network, Cloud/Infra teams.
Calm under pressure, structured thinker, strong ownership. Bias for root-cause and prevention over repeated firefighting. High attention to detail and commitment to operational excellence.