Data Engineer: LLM Training Data
Remote
About the role:
As a Data Engineer: LLM Training Data at Arcee, you will be responsible for designing and implementing data pipelines and processes that ensure the integrity and quality of data used for training Large Language Models, as well as synthetic dataset generation for LLM training. You will collaborate closely with our Researchers, Machine Learning Engineers, and other stakeholders to gather requirements and ensure the availability of high-quality datasets.
What you’ll do:
- Source and acquire diverse and high-quality datasets from various sources.
- Develop and maintain robust data pipelines to ingest, process, and transform raw data into formats suitable for LLM training.
- Clean and preprocess data, including handling missing values, normalization, deduplication, and other data quality tasks.
- Implement data validation and monitoring processes to ensure the accuracy and consistency of datasets.
- Implement a data pipeline to augment datasets with public data as well as synthetically-generated data.
- Collaborate with Researchers and Machine Learning Engineers to understand data requirements and deliver datasets that meet their needs.
- Optimize data storage and retrieval processes for efficiency and scalability.
- Stay up-to-date with industry best practices and emerging technologies in data engineering and AI.
What we’re seeking:
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
- Proven experience in data engineering, with a focus on data sourcing, preparation, and cleaning.
- Strong programming skills in languages such as Python, SQL, and familiarity with data engineering frameworks (e.g., Apache Spark, Apache Kafka).
- Experience with cloud platforms (AWS being the most prominent) and data storage solutions (e.g., S3, BigQuery, Redshift).
- Knowledge of ETL (Extract, Transform, Load) processes and tools.
- Understanding of data quality best practices and techniques for ensuring data integrity.
- Experience working with large-scale datasets and distributed data processing systems.
- Familiarity with Machine Learning concepts and the specific data requirements for training Large Language Models.
- Knowledge of MLOps practices and tools for managing data pipelines and workflows.
- Strong problem-solving skills and the ability to work effectively in a collaborative startup environment.
- Excellent communication skills, with the ability to explain complex data concepts to both technical and non-technical stakeholders.
- Prior experience in a startup environment or a fast-paced, dynamic work setting.
About Arcee.AI
Arcee.AI emerged from the brainstorming of three co-founders – Mark McQuade, Jacob Solawetz, & Brian Benedict – who envisioned a platform that would allow companies to use SLMs to fuel innovation while still retaining full control over their data and models. It was a vision based on deep knowledge of both the technical and business aspects of AI and machine learning, which they had gained via leading roles in companies including Hugging Face, Roboflow, and Tecton.
Upon Arcee.AI’s emergence from stealth in September 2023, the market immediately confirmed the need for their easy-to-use platform for creating performant and efficient custom LLMs, or what they call Small Language Models (SLMs). As they announced their Seed Round in January 2024, quickly followed by their Series A in July, they say what they’re most proud of is seeing their expanding customer base empowered by Arcee.AI-built SLMs – which are driving business value and innovation for enterprises across the globe every day.
Equal Opportunity
We are an Equal Opportunity Employer, offering equal opportunity to all regardless of race, religion, gender identity, sexual orientation, age, citizenship, marital status, disability, and more. We would like to remind candidates that the listed qualifications for each role are not hard requirements, and we encourage them to apply if they feel they would be a good fit.
Compensation
We offer competitive salaries, equity, and benefits. We base our salaries on location, role, and level as well as consideration of the candidate’s experience and overall qualifications.
Apply for this Job