Rockset’s vision is to make the world more data driven. Building powerful data applications today requires a combination of complex interdependent data management systems such that they often resemble a Rube Goldberg machine of sorts. With Rockset, we imagine a world where developers and data scientists can go from complex data sets to fast interactive applications and analysis effortlessly within minutes. Help us make this world a reality!
We are a fast-growing company. We value curiosity, diversity, and open-mindedness. You will solve interesting problems, surrounded by exceptional people, while making customers happy. We work hard, but also take our personal lives and experiences seriously. We are based in San Mateo, CA.
ABOUT THE OPPORTUNITY
As a site reliability engineer, you will be responsible for the automation, stability, security, configuration, monitoring, alerting, and capacity planning of Rockset's network, systems, and infrastructure. You will also build tools that help the rest of the engineering team be more productive, and including the ones that Rockset engineers use to deploy and manage their services. The team currently consists of one, so you will have a foundational impact on the systems we create. The on-call pager is shared by most of the engineering team, not just SRE.
Our infrastructure is completely hosted in Amazon Web Services. We use a variety of home grown, open source, and commercial tools, including Kubernetes, Docker, Kafka, Zookeeper, Prometheus, Grafana, Salt, Terraform, Phacility, and Buildkite. We try to deploy new code to our production environment twice a week, but as an SRE you can expect to make production changes on a daily basis.
You should expect to collaborate with all other engineering teams to develop solutions that meet reliability, security, and business requirements. Lastly, you will diagnose, triage, and build solutions for complex technical issues at scale.
You'd be a great fit, if you have:
- passionate about distributed systems, database technologies, and highly scalable services
- poised under fire and willing to share an on-call rotation with the rest of the team
- a self-starter who thrives in a fast-paced environment
- willing to learn new skills and technologies
- attentive to details and comfortable with ambiguity
It would be even more awesome if you also have:
- Bachelor's or Master's degree in Computer Science or a related field, or relevant work experience
- Experience as an SRE for 3+ years
- Experience building and operating public-facing 24x7 web applications at scale
- Experience working with cloud infrastructure and patterns (AWS preferred)
- Strong programming skills in a scripted language (Python, Ruby, Bash)
- Experience with Kubernetes, Mesos, Swarm, or similar container orchestration tools
- Experience with Terraform, Salt, Chef, Packer, or similar configuration management tools
- Experience with Grafana, Prometheus, Datadog, or similar monitoring tools
- Competitive salary & equity at a fast-growing startup
- Fully funded comprehensive medical, dental, and vision coverage
- Lunch provided every day
- Flexible schedule
- Flexible paid time off (we encourage at least 3 weeks a year)
- Paid parental leave
- Work from Tahoe week, where for 1 work week the entire company temporarily relocates to Tahoe
- Fun office environment with TGIF happy hours, board game nights, picnics and BBQs
- Whatever equipment you need to get the job done
OUR COMMITMENT TO DIVERSITY
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.