Job Overview
Looking for strong Data Engineers to join our team at Clara Analytics, a fast-paced AI startup in InsureTech. The hire will be responsible for developing software to ingest data received from our customers by implementing data pipelines, processing and increasing the quality of data and then, contribute to building Machine Learning models that would be used for generating predictions of outcomes that add value to our customers data. These predictions would be used by our customers to meet their business needs.
An ideal candidate would be very familiar with Apache Open Source projects, AWS and the Hadoop ecosystem. They should also be very comfortable working with application engineers, data scientists, product managers and the customer delivery team to meet our business objectives.
The right candidate would also be required to contribute to the overall Data Architecture at Clara Analytics and be prepared to augment or even, re-design parts of the data platform to adapt to the rapidly increasing customer requirements in InsureTech.
Responsibilities
● Implement and maintain Data Pipeline architecture using technologies like Airflow
● Process large, complex data sets from customers that often need augmenting from other authoritative data sources to improve quality and validation
● Develop software to make sure data is represented in a canonical form that can be used to train Machine Learning models
● Work on accurately identifying the various entities used in our data platform like Providers, Hospitals, Clinics, Organizations, Attorneys, etc
● Provide Search and Load interfaces for data to work within and across the organization. Implement software to make data accessible more easily
● Be champions of and ensure the highest level of security is followed for customer data that are categorized as PII (Personal Identifiable Information) and PHI (Protected Health Information)
● Work with DevOps team to provide data redundancy across various AWS Availability Zones
● Be curious and passionate about building products that increase customer value. It is required of data engineers to participate in complex design discussions that help make our products stronger, more resilient while having a sharp customer focus.
Qualifications
● Strong familiarity with the Apache Open Source stack and the Hadoop ecosystem, including Apache Spark, Hive, ElasticSearch
● Familiarity with AWS cloud services including EC2, S3, structured and unstructured data tools, security roles, etc
● Strong analytical skills and advanced SQL knowledge, indexing, query optimization techniques
● Implement software around data processing, metadata management, data transformation, dependency and workflow DAGs (Directed Acyclic Graphs)
● Strong analytic skills working on semi-structured and unstructured data
● Developing software that helps data scientists to build predictions on cleaned, high quality data sets
● Experience working with cross-functional teams in a fast-paced environment
● Candidates with 3+ years experience in data engineering, who have obtained a Graduate degree in the field of Computer Science, Computational Statistics, Information Systems or other quantitative fields is desirable. A Bachelor's degree with 5+ years of relevant experience in the above fields is required.
● Experience with the following software/tools is highly desired :
○ Apache Spark, Kafka, Hive, etc
○ SQL and NoSQL databases like DynamoDB, Cassandra, MySQL, Postgres
○ Workflow management tools like Airflow
○ AWS cloud services: EC2, S3, EMR, RDS, RedShift
○ Familiarity with Spark programming paradigms (batch and stream-processing)
○ Strong programming skills in at least one of the following languages: Java, Scala, Python, C++
Familiarity with one or more scripting languages