Python: The Language Data Engineers Use Daily

What is Python?

Python is a programming language. Easy to read. Powerful. Flexible.

Data engineers use it for everything: extract data, transform data, build pipelines, automate tasks, analyze data.

Why Python for Data Engineering?

Readable: Code looks like English.

customers = get_data_from_database('customers')
cleaned = remove_duplicates(customers)
save_to_warehouse(cleaned)

Compare to other languages: Much more complicated.

Libraries: Thousands of libraries for data work.

  • Pandas: Manipulate data
  • NumPy: Math and arrays
  • SQLAlchemy: Talk to databases
  • Requests: Get data from APIs
  • PySpark: Big data processing

Community: Huge Python community. Answers to every question exist.

Industry standard: Most data teams use Python for pipelines, scripting, and tooling.

What You Do With Python in Data Engineering

Extract data:

import requests
response = requests.get('https://api.example.com/customers')
data = response.json()

Transform data:

import pandas as pd

df = pd.read_csv('orders.csv')
df['total'] = df['quantity'] * df['price']
df['month'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m')

Load data:

from sqlalchemy import create_engine

engine = create_engine('postgresql://user:pass@localhost/mydb')
df.to_sql('orders_processed', engine, if_exists='append')

Automate tasks:

import schedule
import time

def run_pipeline():
    extract()
    transform()
    load()

schedule.every().day.at("2:00").do(run_pipeline)

while True:
    schedule.run_pending()
    time.sleep(60)

Build pipelines:

from airflow import DAG
from airflow.operators.python import PythonOperator

# Define workflow
extract >> transform >> load

Core Python Libraries for Data Engineers

Pandas: Manipulate data like spreadsheets.

import pandas as pd

df = pd.read_csv('data.csv')
df['total'] = df['qty'] * df['price']
df.groupby('category').sum()

SQLAlchemy: Talk to any database.

from sqlalchemy import create_engine, text

engine = create_engine('postgresql://user:pass@localhost/mydb')
result = engine.execute(text('SELECT * FROM customers'))

Requests: Get data from APIs.

import requests

response = requests.get('https://api.github.com/users/github')
data = response.json()

PySpark: Process large data.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('myapp').getOrCreate()
df = spark.read.csv('large_file.csv')
df.filter(df.age > 25).show()

Logging: Track what your code does.

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info('Starting pipeline')
logger.error('Pipeline failed')

Real Example: Extract, Transform, Load in Python

import pandas as pd
from sqlalchemy import create_engine
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def extract():
    """Extract data from database"""
    logger.info('Extracting data...')
    engine = create_engine('postgresql://user:pass@localhost/mydb')
    df = pd.read_sql('SELECT * FROM raw_orders', engine)
    return df

def transform(df):
    """Clean and transform data"""
    logger.info('Transforming data...')

    # Remove duplicates
    df = df.drop_duplicates()

    # Remove nulls
    df = df.dropna()

    # Calculate totals
    df['total'] = df['quantity'] * df['price']

    # Add month
    df['month'] = pd.to_datetime(df['date']).dt.strftime('%Y-%m')

    return df

def load(df):
    """Load data to warehouse"""
    logger.info('Loading data...')
    engine = create_engine('postgresql://user:pass@localhost/warehouse')
    df.to_sql('orders_processed', engine, if_exists='append', index=False)
    logger.info(f'Loaded {len(df)} rows')

# Run pipeline
try:
    df = extract()
    df = transform(df)
    load(df)
    logger.info('Pipeline completed successfully')
except Exception as e:
    logger.error(f'Pipeline failed: {e}')

Python Best Practices

Use virtual environments:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Keep requirements.txt:

pandas==2.1.0
sqlalchemy==2.0.23
requests==2.31.0
python-dotenv==1.0.0

Use logging, not print():

# Bad
print("Data loaded")

# Good
logger.info("Data loaded")

Handle errors:

try:
    connect_to_database()
except ConnectionError:
    logger.error("Database connection failed")
    raise

Use functions:

# Bad
data = read_csv('file.csv')
data['total'] = data['qty'] * data['price']
data.to_csv('output.csv')

# Good
def process_orders(input_file, output_file):
    data = read_csv(input_file)
    data['total'] = data['qty'] * data['price']
    data.to_csv(output_file)

process_orders('input.csv', 'output.csv')

Use environment variables for secrets:

import os

db_password = os.getenv('DB_PASSWORD')
api_key = os.getenv('API_KEY')

Common Python Patterns in Data Engineering

Reading from multiple sources:

df1 = pd.read_csv('file1.csv')
df2 = pd.read_sql('SELECT * FROM table2', engine)
df3 = requests.get('https://api.example.com/data').json()

combined = pd.concat([df1, df2, df3])

Batch processing:

for chunk in pd.read_csv('large_file.csv', chunksize=1000):
    process_chunk(chunk)

Error handling:

def get_data_safely(query):
    try:
        return pd.read_sql(query, engine)
    except Exception as e:
        logger.error(f'Query failed: {e}')
        return None

Parallel processing:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=4) as executor:
    results = executor.map(process_item, items)

Python vs Other Languages for Data Engineering

Python: Readable, lots of libraries, easy to learn. Best for most data work.

Scala: Faster for big data. More complex. Used with Spark.

SQL: Essential. But limited (transformation only). Always paired with Python.

R: Good for statistics. Less used in production. More academic.

Java: Complex. Only for specific systems. Not recommended.

For data engineering: Python is the right choice.

Getting Started with Python

Install Python 3.11:

# Download from python.org
python --version

Create project:

mkdir my_project
cd my_project
python -m venv venv
source venv/bin/activate

Install libraries:

pip install pandas sqlalchemy requests

Write simple script:

import pandas as pd

df = pd.read_csv('data.csv')
print(f"Loaded {len(df)} rows")
print(df.head())

Run it:

python script.py

That’s it. You’re doing Python data engineering.

The Python Ecosystem

Development: Write code

  • VS Code, PyCharm

Testing: Verify code works

  • pytest, unittest

Documentation: Explain code

  • docstrings, markdown

Packaging: Share code

  • pip, setuptools

CI/CD: Automate deployment

  • GitHub Actions, Jenkins

These tools make Python professional and reliable.

Real Example: Daily Pipeline in Python

#!/usr/bin/env python3
"""Daily ETL pipeline"""

import logging
import pandas as pd
from datetime import datetime
from sqlalchemy import create_engine

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

def main():
    """Main pipeline"""
    logger.info('Pipeline started')

    # Extract
    engine = create_engine('postgresql://user:pass@localhost/mydb')
    df = pd.read_sql(
        'SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY',
        engine
    )
    logger.info(f'Extracted {len(df)} rows')

    # Transform
    df['total'] = df['quantity'] * df['price']
    df['date'] = pd.to_datetime(df['created_at']).dt.date

    # Load
    warehouse = create_engine('postgresql://user:pass@localhost/warehouse')
    df.to_sql('daily_orders', warehouse, if_exists='append', index=False)
    logger.info(f'Loaded {len(df)} rows to warehouse')

    logger.info('Pipeline completed')

if __name__ == '__main__':
    main()

Schedule this to run daily with Airflow or cron. Done.

Bottom Line

Python is the language data engineers use every day.

Master Python basics. Learn libraries (pandas, SQLAlchemy, requests). Write clean code.

You’ll be valuable to any data team.

Need help implementing this in your company?

For delivery-focused missions (Data Engineering, Architecture Data, Data Product Owner), visit ISData Consulting.