EOS RPO
Senior Software Engineer-Platform engineering/application support,Ansible, Apigee, puppet
Required Qualifications:
4+ years of Software Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Desired Qualifications
4+ years of software engineering experience
4+ years of application production support experience
Education BS/BA degree or higher
Desired Qualifications
An industry-standard technology certification
Strong verbal, written, and interpersonal communication skills
3+ years of experience with Cloud technologies
Knowledge and understanding of Site Reliability Engineering (SRE) concepts
3+ years of Agile experience
Advanced scripting skills specifically around automation, log rotation, data collection, error collection and alerting
Scripting and automation experience
Experience with complex business logic and dependencies
3+ years of CI/CD automation and configuration experience (DevOps / pipeline automation)
3+ years of experience with ITSM processes (e.g. Incident Management, Change Management, Asset Management and Configuration Management)
Hands-on experience with writing / maintaining technical documentation such as fixlogs, runbooks, knowledge base, architectural diagrams
Hands-on experience with system administration across multiple platforms
Hands-on experience with one or more software development languages: Java, JavaScript, Ruby, Python, JSON, Angular, NodeJS, .Net/C#
Hands-on experience with one or more CI/CD automation tools: Jenkins, Gradle, Maven, Git, SonarQube, Artifactory, Ansible, Puppet, Apigee
Hands-on experience with one or more process management and scheduling tools: Autosys, JAWS
Hands-on experience with one or more Monitoring/Observability/APM/Analytics tools: Splunk, Elastic, Kibana, Grafana, Prometheus, AppDynamics, Dynatrace, New Relic, DataDog, Kafka, CloudWatch, Jaeger, Zipkin, Big Panda, TrueSight
Hands-on experience with one or more Server OS: Windows, Linux, Unix, Mainframe
Hands-on experience with one or more Cloud and virtualization technologies: Azure, GCP, AWS, PCF, PKS, Kubernetes, OpenShift, VMware
Hands-on experience with one or more Data storage, management and messaging technologies: Kafka, IBM MQ, Apache Airflow, Logstash, Spark, Oracle, SQL, MongoDB, Cassandra, Hadoop, Cloudera, AWS EMR, S3
Hands-on experience with one or more Testing Frameworks: Selenium, JMeter, Blazemeter, Performance Center, Perfecto, Cucumber, Gherkin, ALM, Gremlin, Chaos Monkey, Chaos Toolkit, Simian Army, Toxi Proxy
Working knowledge of TCP/IP networking, experience analyzing packet captures to assist in troubleshooting
Working knowledge of Internet technologies: routing, NAT, firewalls, load-balancing, proxies, web servers
JOB EXPECTATIONS:
Ability to work additional hours as needed
Ability to work on call as assigned
Flexibility to work in a 16/7 environment, including weekends and holidays
Operational Ownership / Application Support:
Maintain system operational knowledge (functional and technical)
Understand and monitor system operation, ensure optimal availability, functional health, and performance (driven by SLO/SLA)
Triage alerts, respond to incidents, perform root cause analysis (troubleshooting)
Handle users' questions and requests related to business systems (not a Desktop Support)
Change requests implementation (manual deployment steps, overall deployment coordination)
BCP planning and implementation
Ensure continuous improvements of operational processes and methods
Reliability Engineering:
Analyze system's monitoring and observability needs (technical, functional, business), and create or adjust logging, monitoring, alerting and analytics solutions to cover those needs
Use understanding of software engineering (system code) and infrastructure to improve the depth and quality of root cause analysis (troubleshooting)
Partner with Architecture, Infrastructure and Development teams to influence decisions that impact reliability and supportability
Identify routine or risky manual operations, and create automation solutions (scripting, tooling) or influence fixing the sources of manual work (as appropriate)
Drive deeper post-incident reviews for major incidents, to learn and improve
Engage in weakness research and analysis, and architectural reviews, to use deep knowledge of production operation to suggest improvements
Use deep knowledge of production operation to create detailed high-quality stories and tasks on DEV owners' backlog, with the focus on reliability and supportability
Ensure continuous improvements of systems' reliability and supportability