Essential Tools for Data Engineers: Build Your Toolkit

December 04, 2024

What Tools Do Data Engineers Really Need?

A lot of tools exist. Most are noise.

This is what actually matters:

SQL: Talk to databases
Python: Write code and pipelines
Git: Version control
Docker: Package and deploy
Airflow: Schedule and monitor
Databases: Store and retrieve data
Cloud platform: Where it all runs

Master these. You’re valuable.

The Core Stack

1. SQL - Non-Negotiable

Every data engineer writes SQL daily.

-- Extract
SELECT * FROM orders WHERE date > '2025-01-01';

-- Transform
SELECT customer_id, SUM(amount) as total
FROM orders
GROUP BY customer_id;

-- Load
INSERT INTO warehouse.daily_summary (...)
SELECT ...
FROM orders
WHERE ...

Without SQL: You can’t do data engineering.

2. Python - The Glue

SQL handles data in databases. Python does everything else.

# Extract with SQL
df = pd.read_sql(query, engine)

# Transform with Python
df['total'] = df['qty'] * df['price']
df = df.drop_duplicates()

# Load back
df.to_sql('processed', engine)

Python + SQL = unstoppable combination.

3. Git - Collaboration and History

Your code needs version control.

git add pipeline.py
git commit -m "Fix SQL join in daily report"
git push

Non-negotiable for professional work.

4. Docker - Reproducibility

Package everything. Run everywhere.

FROM python:3.11
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY pipeline.py .
CMD ["python", "pipeline.py"]

Your local Docker environment = production environment. Predictable.

5. Airflow - Orchestration

Schedule and monitor your pipelines.

extract >> transform >> load

Airflow handles:

When to run
Retries on failure
Monitoring
Alerting

The Real Workflow

Database (SQL) → Extract (Python) → Transform (Python) → Load (Python)
                        ↓
                    Version control (Git)
                        ↓
                    Package (Docker)
                        ↓
                    Schedule (Airflow)

That’s data engineering.

Skills Progression

Month 1: SQL

Write basic queries
Understand databases
Extract data

Month 2: Python

Learn syntax
Use Pandas
Transform data

Month 3: Git + Version Control

Commit code
Collaborate with team
Review changes

Month 4: Docker

Create images
Run containers
Deploy pipelines

Month 5: Airflow

Create DAGs
Schedule workflows
Monitor pipelines

Month 6+: Advanced

Optimization
Testing
Complex workflows

Real Data Engineering Day

9 AM: Check Airflow dashboard. All pipelines ran successfully.

10 AM: New requirement: Add field to daily report.

# Modify Python script
df['new_field'] = df['col1'] * df['col2']

10:30 AM: Test locally.

docker build -t pipeline:latest .
docker run pipeline:latest

11 AM: Verify results in database.

SELECT * FROM warehouse.daily_report LIMIT 10;

11:30 AM: Commit and push to Git.

git add -A
git commit -m "Add new_field to daily report"
git push

12 PM: Deploy new Docker image.

docker push pipeline:latest

1 PM: Lunch. Pipeline runs automatically.

2 PM: Check logs. Everything working. Go back to designing next feature.

That’s real data engineering.

Tools You DON’T Need (Yet)

Spark: Learn Pandas first. Add Spark later if you have 100GB+ datasets.

Kubernetes: Learn Docker first. Add Kubernetes if you need to manage 100+ containers.

Data Warehouse: Use PostgreSQL first. Switch to BigQuery/Snowflake if needed.

Advanced ML tools: Learn Python and SQL first. Everything else is optional.

Data catalogs, metadata tools, etc: Build solid pipelines first. Add later.

Start simple. Add complexity only when you hit real problems.

Minimal Viable Toolkit

Day 1:

PostgreSQL (free, reliable)
Python (free, powerful)
VS Code (free, good editor)

That’s $0. You can start today.

Write a script that:

Extracts from PostgreSQL
Transforms data
Loads to another table

Done. You’re doing data engineering.

Week 2: Add Git Month 2: Add Docker Month 3: Add Airflow

Progressive, builds on previous skills.

Real Example: Minimal Setup

Create a simple data pipeline with zero budget.

Step 1: Install tools

# Python
python --version

# PostgreSQL
# Download from postgresql.org

# Git
git --version

# Editor
# Download VS Code

Step 2: Create project

mkdir my_pipeline
cd my_pipeline
git init

Step 3: Write pipeline

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@localhost/mydb')

# Extract
df = pd.read_sql('SELECT * FROM raw_data', engine)

# Transform
df['total'] = df['qty'] * df['price']
df = df.drop_duplicates()

# Load
df.to_sql('processed_data', engine, if_exists='replace')

print(f"Processed {len(df)} rows")

Step 4: Test

python pipeline.py

Step 5: Version control

git add pipeline.py
git commit -m "Initial pipeline"
git push

Done. You have a working data pipeline with version control.

Add Docker next. Then Airflow. Progressive improvement.

Tool Comparison

Tool	Purpose	Priority
SQL	Query data	Critical
Python	Write code	Critical
PostgreSQL	Store data	Critical
Git	Version control	Critical
Docker	Deployment	High
Airflow	Scheduling	High
dbt	Data transformation	Medium
Spark	Big data	Medium
BigQuery	Cloud warehouse	Medium
Kafka	Streaming	Low

Master the critical ones. Everything else is bonus.

Where to Start

Learn SQL (2 weeks)
- Write 10 queries against sample data
- Understand joins, aggregations, subqueries
Learn Python (2 weeks)
- Write 5 small scripts
- Learn Pandas, file I/O, basic functions
Combine SQL + Python (1 week)
- Write a script that extracts with SQL, transforms with Python, loads back
- Run it 5 times. Verify results
Add Git (1 week)
- Put your script in Git
- Make changes, commit, push
Add Docker (1 week)
- Containerize your script
- Run locally from Docker
Add Airflow (1 week)
- Schedule your script
- Run automatically daily

7 weeks. You have the basics to build and run simple pipelines and to keep learning on the job.

Professional Toolkit Example

Company size: 50 people

PostgreSQL (main database)
Python (scripts and pipelines)
Git (GitHub for code)
Docker (package applications)
Airflow (schedule pipelines)
S3 (file storage, optional)

That’s 90% of what they need. Everything else is optimization.

Company size: 500 people

Multiple databases (PostgreSQL, Redshift, DynamoDB)
Python + Scala (Spark for big data)
Git (GitHub or GitLab)
Docker (containerization)
Kubernetes (container orchestration, optional)
Airflow (orchestration)
Data warehouse (Redshift or Snowflake)
Data lake (S3 or HDFS)
Kafka (streaming, optional)

More tools. Same principles.

Bottom Line

Master the core tools:

SQL
Python
Git
Docker
Airflow
Databases

Everything else is nice to have.

You don’t need the fanciest tools. You need solid fundamentals.

Write clean code. Version control it. Test it. Deploy it reliably. Monitor it.

Those practices matter more than any specific tool.

Start with basics. Add tools as you hit real limitations.

Most data engineers use the same core toolkit their whole career.

Need help implementing this in your company?

For delivery-focused missions (Data Engineering, Architecture Data, Data Product Owner), visit ISData Consulting.

Visit isdataconsulting.com Book a discovery call →