The Data Engineer will design and maintain scalable data pipelines to ensure high-quality, accessible data for machine learning, analytics, and business applications. The ideal candidate will have expertise in data architecture, ETL processes, and handling large datasets.
Job Summary
The Data Engineer will be responsible for building and maintaining scalable and efficient data pipelines, ensuring that high-quality, accessible data is available for machine learning models, analytics, and other business applications. The ideal candidate will have strong experience in data architecture, ETL (Extract, Transform, Load) processes, and working with large datasets.
Key Responsibilities
Data Pipeline Development
Design, build, and maintain robust data pipelines to support AI and business intelligence use cases.
Implement ETL processes for transforming raw data into structured, clean, and usable formats for analysis.
Automate data integration processes to ensure consistency and efficiency.
Data Quality & Governance
Ensure data quality by implementing data validation checks and building processes for cleaning, filtering, and transforming data.
Develop and enforce data governance standards, ensuring compliance with regulatory and organizational policies.
Work with data scientists and analysts to ensure that data is suitable for modeling and analysis.
Database Management & Optimization
Manage data storage and ensure data is stored in a cost-effective and scalable manner (e.g., relational databases, data lakes, cloud storage).
Optimize database performance, including query efficiency, indexing, and storage management.
Implement data partitioning and sharding strategies for large datasets.
Collaboration with Cross-Functional Teams
Work closely with data scientists, machine learning engineers, and analysts to understand data requirements and provide access to the necessary data.
Collaborate with applications team to integrate data-driven solutions into business applications.
Cloud & Infrastructure Management
Utilize cloud platforms (e.g., AWS, Google Cloud, Azure) to build scalable, flexible data infrastructure.
Implement data pipelines that handle both batch and real-time data ingestion.
Ensure the security and privacy of data by following best practices in cloud storage and infrastructure.
Performance Monitoring & Troubleshooting
Set up monitoring for data pipelines to track performance and troubleshoot issues proactively.
Continuously improve data processes to ensure optimal performance, scalability, and fault tolerance.
Key Qualifications
Bachelor’s or Master’s degree in Computer Science, Data Engineering, Information Technology, or a related field.
3+ years of experience in data engineering, data architecture, or a related role.
Expertise in SQL, Python, and other data processing languages (e.g., Scala, Java).
Hands-on experience with ETL tools (e.g., Apache NiFi, Apache Airflow, Talend).
Familiarity with cloud platforms (e.g., AWS, Google Cloud, Azure) and their data services (e.g., BigQuery, Redshift, S3).
Experience with data storage technologies such as relational databases, NoSQL, data lakes, or data warehouses.
Strong problem-solving and debugging skills.
Ability to work in an agile, fast-paced environment.
Knowledge of data governance and compliance regulations (e.g., GDPR, HIPAA) is a plus.