We are looking for someone who can work well independently who also embraces a great team dynamic! The Site Reliability Engineer will be responsible for the maintenance of a 7 x 24 x 365 production environment running Linux (CentOS 6 and 7). The scope of work includes server/hardware installations, OS upgrades, patch management, configuration management, release management, automation, and security administration. Time will be split between ops related work and development work, the scope of which would be to reduce the amount of ops work through automation and new features. The ideal candidate would be equally comfortable writing the application code as they would creating the docker image and running it in Kubernetes. People who like to work hard and play hard will love working for Dstillery!
Essential Functions and Responsibilities
- Build and maintain a highly automatic, self-healing, scalable infrastructure.
- Support and maintain on-premise production servers (Lenovo, IBM, and HP blades and rackmount servers).
- Server hardware installation; OS installation using installation tools (Cobbler, Kickstart, Salt, Puppet, etc).
- Responsible for server builds, configuration, backups and patches.
- Evaluate and implement new technologies that furthers the adoption of DevOps practices
- Provide guidance during the planning stage to development teams in relation to system design/architecture.
- Work closely with development and operation teams during the entire service lifecycle from inception to production with the goal to produce reliable, scalable services.
- Maintain and add additional features to the CI/CD pipeline.
- Develop checks and maintain system health monitoring and administration tools (Icinga, Grafana, ELK, InfluxDB).
- Fix production issues by analyzing code and making hotfixes through the standard code deployment process.
- Provide documentation of configuration and troubleshooting procedures to facilitate other team members.
- Automate away repeated tasks.
- Participate in a rotating 24x7 on-call schedule, with responsibility to see problems through resolution.
- Available to occasionally work evenings or weekends for large-scale or high-priority projects.
- Periodic travel to colocation facility in NJ and occasional travel to LA.
- Evangelize DevOps principles and practices.
- 5+ years Linux experience in a 24x7 production environment. with solid understanding of performance tuning and end-to-end troubleshooting.
- Dev Tools including Git, Jenkins and Docker
- 3+ years Software Development experience (Java preferred)
- Previously built a Private Cloud or worked in a hybrid on-prem/public cloud environment leveraging Kubernetes
- Programming Experience (Java, Groovy, Python or Go)
- Install and manage tomcat and jetty applications including experience with JVM internals and tuning.
- Demonstrated troubleshooting skills through resolution; understanding problem from the network, OS, and application levels.
- Ability to configure, customize, and support typical UNIX daemons, including NFS, BIND and Apache.
- Comfortable in a colocation environment, with regards to power and cable management, cooling, and adherence to installation standards.
One or more of the following is a PLUS:
- Load Balancing Technologies (LVS, Piranha, Consul)
- Routing and Switching (BGP, Cisco IOS)
- Hadoop and related technologies (HDFS, YARN, Hive, Tez, Spark, Flume, Zookeeper, Ambari)
- NoSQL databases (Cassandra, Scylla) and relational databases (MySQL, PostgreSQL).
- Open source virtualiztion technologies (oVirt)
- Messaging Brokers (Kafka)