Role: Service Reliability Engineer
Emp Type: Permanent Job
Interview: Phone/Skype
Work Location: Cincinnati / Minneapolis / NYC area. Open to Travel
Responsibilities
• Ensure the reliability of the IT environment (Application & Infrastructure), Fault layer isolation in case of a Major Incident in a production environment and troubleshooting the root cause across the entire stack (application to infrastructure)
• Assessing any environmental change in the production environment from the perspective of its impact on the reliability of the environment
• Create/update/review runbooks, troubleshooting check-lists incident response reports, postmortem reports, and root cause analysis.
• Identifying opportunities of automation in routine operational activities viz. health-checks, eyes-on-glass monitoring, deployment scripts, configuration management scripts
• Using predictive modelling to do capacity planning of business systems by correlating business workloads and system usage
• Training & grooming aspiring Reliability Engineers
Basic Qualifications:
• BS Degree in engineering, science, mathematics, information systems or computer science
• 7- 10 yrs. experience working in a managing IT production support application and infrastructure preferably as a Reliability Engineer or a Performance Engineer of IT systems
Must Have Skills
• Familiarity with business systems (COTS and bespoke), integration patterns and underlying infrastructure architecture, cloud platforms, IaaS, PaaS, SaaS, SoA, Microservices, API, Containerization, Big Data Technologies, Network architectures etc.
• Good troubleshooting skills across application & infrastructure, familiarity with different types of system logs and Error Messages/ Code and its interpretation.
• Ability to do fault layer isolation across the application and infrastructure stack to identify hot spots/ issues in terms of capacity, performance, high availability etc.
• Understanding of Service Level Indicators, Service Level objectives and Error budgets for managing Reliability of systems
• Have a familiarity with Java, Python, Powershell
• Strong communication skills (written, verbal, and listening) with the ability to communicate effectively with stakeholders of varying technical expertise
• Tools Hands on Knowledge
o Systems monitoring, alerting and analytics (Nagios, Solarwinds, Dynatrace, NewRelic, AppDynamics, Splunk ITSI, ELK stack, Kafka, Sumologic etc)
o Configuration management tools like Chef, Puppet, Ansible
o Orchestration Tools like Automic Service Orchestration, Ayehu eyeShare, CA Workload Administrator etc
o Virtualization and containers (Xen, KVM, Docker, etc), and Storage (NFS, SANs, RAID, lvm)
o DevOps tools (Bit bucket, Jenkins, Jules, Automated deployment tools) with CICD capabilities
Good to have skills
• TOGAF certification