Service Reliability Engineer

Role:                       Service Reliability Engineer 
Emp Type:             Permanent Job
Interview:               Phone/Skype
 Work Location:    Cincinnati / Minneapolis / NYC area. Open to Travel
 
 

 Responsibilities 
 • Ensure the reliability of the IT environment (Application & Infrastructure), Fault layer isolation in case of a Major Incident in a production environment and troubleshooting the root cause across the entire stack (application to infrastructure)
 • Assessing any environmental change in the production environment from the perspective of its impact on the reliability of the environment
 • Create/update/review runbooks, troubleshooting check-lists incident response reports, postmortem reports, and root cause analysis.
 • Identifying opportunities of automation in routine operational activities viz. health-checks, eyes-on-glass monitoring, deployment scripts, configuration management scripts 
 • Using predictive modelling to do capacity planning of business systems by correlating business workloads and system usage
 • Training & grooming aspiring Reliability Engineers
  
 Basic Qualifications: 
 • BS Degree in engineering, science, mathematics, information systems or computer science 
 • 7- 10 yrs. experience working in a managing IT production support application and infrastructure preferably as a Reliability Engineer or a Performance Engineer of IT systems
 
 Must Have Skills
 • Familiarity with business systems (COTS and bespoke), integration patterns and underlying infrastructure architecture, cloud platforms, IaaS, PaaS, SaaS, SoA, Microservices, API, Containerization, Big Data Technologies, Network architectures etc.
 • Good troubleshooting skills across application & infrastructure, familiarity with different types of system logs and Error Messages/ Code and its interpretation. 
 • Ability to do fault layer isolation across the application and infrastructure stack to identify hot spots/ issues in terms of capacity, performance, high availability etc. 
 • Understanding of Service Level Indicators, Service Level objectives and Error budgets for managing Reliability of systems 
 • Have a familiarity with Java, Python, Powershell
 • Strong communication skills (written, verbal, and listening) with the ability to communicate effectively with stakeholders of varying technical expertise 
 • Tools Hands on Knowledge
 o Systems monitoring, alerting and analytics (Nagios, Solarwinds, Dynatrace, NewRelic, AppDynamics, Splunk ITSI, ELK stack, Kafka, Sumologic etc)
 o Configuration management tools like Chef, Puppet, Ansible
 o Orchestration Tools like Automic Service Orchestration, Ayehu eyeShare, CA Workload Administrator etc
 o Virtualization and containers (Xen, KVM, Docker, etc), and Storage (NFS, SANs, RAID, lvm)
 o DevOps tools (Bit bucket, Jenkins, Jules, Automated deployment tools) with CICD capabilities
 Good to have skills 
 • TOGAF certification
 
 

Want to apply later?

Type your email address below to receive a reminder

Apply to Job

ErrorRequired field
ErrorRequired field
ErrorRequired field
Error
Error
insert_drive_file
insert_drive_file