Job Title: Site Reliability Engineer
Location: San Francisco, CA
Emp Type: Permanent Job
Interview: Phone/Skype
About The Role
The client's Site Reliability team is building a highly resilient, performant, and secure internal Platform as a Service which hosts their core backend services to control their fleet of vehicles, runs driving simulations at scale, and executes machine learning training jobs for their Autonomous Vehicle engineering team. They are currently using AWS, Docker, Kubernetes, Vault and Spinnaker. The client is looking for a Site Reliability Engineer (SRE) to help us continue to scale and support our growing infrastructure. The SRE will work DevOps, Engineering Productivity and other engineering teams, providing experience and knowledge to support deployments and architecture.
Day-to-day Responsibilities Include
• Build systems that scale to manage hundreds of petabytes of data across multiple physical locations
• Support our CI & Simulation infrastructure running on AWS GPU instances
• Manage our on-premise and cloud Kubernetes clusters to support growing workloads
• Design and implement best practices for security, monitoring, and logging systems
• Set the technical direction for our infrastructure team
Required Skills -
• 5+ Years of experience
• Experience managing container-based workloads, using Kubernetes or other orchestration software
• Experience writing production software using Go, Python, C++, or similar languages
• Ability to manage competing priorities, focus on shipping, and work well under pressure
• Experience with AWS, or other cloud infrastructure providers
Preferred Skills
• Experience managing physical compute, storage, and networking hardware
• Experience with Hadoop, Spark, or other data processing tools