Databases (SGBD): The Foundation of Data Storage
What is a Database?
A database (SGBD - Système de Gestion de Base de Données in French) is organized data storage.
Think of it as a sophisticated filing system that lets you find, modify, and organize millions of records instantly.
Without databases, companies couldn’t operate. No way to store customer data, orders, inventory, nothing.
The Problem It Solves
Imagine a company without a database:
File storage:
- Customer data in 1000 CSV files
- Search for a customer? Open 1000 files. Find them manually.
- Update their info? Edit the file. Hope you don’t mess up.
- Ensure consistency? Good luck.
Database:
- All customer data in one place
- Search for a customer? Instant
- Update their info? One command
- Consistency guaranteed
- Millions of records? No problem
Databases make companies functional.
Types of Databases
Relational (SQL): Data in tables with relationships.
Customers table:
id | name | email | signup_date
Orders table:
id | customer_id | amount | date
Most common. Used everywhere. PostgreSQL, MySQL, SQL Server.
No-SQL (Document): Data in flexible documents.
{
"id": 1,
"name": "John",
"email": "john@example.com",
"orders": [...]
}
More flexible. Less structured. MongoDB, Cassandra.
Time-series: Data indexed by time.
timestamp | metric | value
2025-12-04 10:00 | cpu_usage | 45%
2025-12-04 10:01 | cpu_usage | 52%
For metrics, logs, sensors. InfluxDB, Prometheus.
Graph: Data with relationships.
Person -> knows -> Person
Person -> created -> Post
For networks, recommendations. Neo4j.
Data Warehouses: Optimized for analysis.
Snowflake, BigQuery, Redshift. Massive datasets. Complex queries. Fast analysis.
For data engineers: Master relational databases first. SQL is essential. Others come later.
How Relational Databases Work
Tables: Collections of related data.
Customers:
┌────┬──────────┬──────────────────┬─────────────┐
│ id │ name │ email │ signup_date │
├────┼──────────┼──────────────────┼─────────────┤
│ 1 │ Alice │ alice@example.com│ 2025-01-15 │
│ 2 │ Bob │ bob@example.com │ 2025-02-20 │
│ 3 │ Carol │ carol@example.com│ 2025-03-10 │
└────┴──────────┴──────────────────┴─────────────┘
Primary Key: Unique identifier for each row.
CREATE TABLE customers (
id INTEGER PRIMARY KEY, -- Unique
name TEXT
);
Foreign Key: Link to another table.
CREATE TABLE orders (
id INTEGER PRIMARY KEY,
customer_id INTEGER REFERENCES customers(id), -- Links to customers
amount DECIMAL
);
Index: Fast lookup.
CREATE INDEX idx_email ON customers(email);
-- Now searching by email is fast
Real Example: E-Commerce Database
Tables:
Customers: id, name, email, address
Orders: id, customer_id, order_date, total
OrderItems: id, order_id, product_id, quantity, price
Products: id, name, category, stock
Relationships:
Customer has many Orders
Order has many OrderItems
OrderItem references Product
Queries:
-- Find all orders for a customer
SELECT * FROM orders WHERE customer_id = 1;
-- Find what products Alice ordered
SELECT p.name, oi.quantity
FROM customers c
JOIN orders o ON c.id = o.customer_id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE c.name = 'Alice';
-- Find customers who spent more than $1000
SELECT c.name, SUM(o.total)
FROM customers c
JOIN orders o ON c.id = o.customer_id
GROUP BY c.id, c.name
HAVING SUM(o.total) > 1000;
Key Database Concepts
ACID: Guarantees data safety.
- Atomicity: Transaction completes or doesn’t. No partial updates.
- Consistency: Data is always valid.
- Isolation: Concurrent transactions don’t interfere.
- Durability: Once saved, data stays saved.
Transaction: Multiple operations as one unit.
BEGIN;
UPDATE customers SET balance = balance - 100 WHERE id = 1;
UPDATE customers SET balance = balance + 100 WHERE id = 2;
COMMIT;
-- Both updates succeed or both fail. No partial transfer.
Normalization: Organize data to avoid duplicates.
Bad: Store customer name in every order row (repeated data)
Good: Store customer_id in orders, name in customers (no duplication)
Denormalization: Sometimes duplicate for speed.
Data warehouse: Store customer name in orders table for faster queries
(slower to update, faster to query)
What Data Engineers Do With Databases
Extract: Pull data for processing.
SELECT * FROM orders WHERE created_at > NOW() - INTERVAL '1 day';
Validate: Check data quality.
SELECT COUNT(*) FROM customers WHERE email IS NULL;
Transform: Modify and aggregate.
SELECT
DATE_TRUNC('month', order_date) as month,
COUNT(*) as order_count,
SUM(amount) as total
FROM orders
GROUP BY month;
Load: Move cleaned data to warehouse.
INSERT INTO warehouse.customer_summary
SELECT customer_id, COUNT(*), SUM(amount)
FROM orders
GROUP BY customer_id;
Monitor: Track health.
SELECT COUNT(*) as row_count FROM orders;
SELECT MAX(updated_at) as last_update FROM customers;
Database Performance
Slow query:
SELECT * FROM orders WHERE order_date LIKE '%2025%';
-- Slow: Full table scan
Fast query:
CREATE INDEX idx_order_date ON orders(order_date);
SELECT * FROM orders WHERE order_date >= '2025-01-01';
-- Fast: Uses index
Slow join:
SELECT * FROM orders, customers WHERE orders.customer_id = customers.id;
-- Slow: No index on foreign key
Fast join:
-- Foreign key creates index automatically
SELECT * FROM orders
JOIN customers ON orders.customer_id = customers.id;
-- Fast: Uses foreign key index
Common Database Issues
Connection timeout: Too many connections.
-- Check connections
SELECT COUNT(*) FROM pg_stat_activity;
Full table scan: No index, slow query.
-- Add index
CREATE INDEX idx_status ON orders(status);
Data inconsistency: Data out of sync.
-- Use transactions to prevent this
BEGIN;
-- Multiple operations
COMMIT;
Backup needed: Disaster recovery.
pg_dump mydb > backup.sql
SQL vs No-SQL Databases
SQL (Relational):
- ✓ Structured data
- ✓ Complex queries
- ✓ ACID guarantees
- ✗ Less flexible schema
- ✗ Slower for unstructured data
No-SQL (Document):
- ✓ Flexible schema
- ✓ Fast for unstructured data
- ✓ Horizontal scaling
- ✗ Less structured
- ✗ Complex queries harder
For data engineering: Start with SQL. It’s more predictable and reliable.
Getting Started With Databases
Install PostgreSQL:
# Download from postgresql.org
# Or use Docker
docker run -e POSTGRES_PASSWORD=password postgres:15
Connect:
psql -U postgres
Create table:
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name TEXT,
email TEXT UNIQUE,
created_at TIMESTAMP DEFAULT NOW()
);
Insert data:
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com');
Query:
SELECT * FROM users;
Update:
UPDATE users SET name = 'Alice Smith' WHERE id = 1;
Delete:
DELETE FROM users WHERE id = 1;
Real Database Scenario
You’re building a pipeline that processes customer orders daily.
Step 1: Connect to production database
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@prod-db:5432/orders')
Step 2: Extract new orders
df = pd.read_sql(
'SELECT * FROM orders WHERE created_at > NOW() - INTERVAL 1 DAY',
engine
)
Step 3: Clean and transform
df = df.drop_duplicates()
df['total'] = df['qty'] * df['price']
Step 4: Load to warehouse
warehouse = create_engine('postgresql://user:pass@warehouse:5432/data')
df.to_sql('daily_orders', warehouse, if_exists='append')
Done. Happens daily. Data flows from production to warehouse.
Bottom Line
Databases are foundational to data engineering.
You must understand:
- How data is organized
- How to query it
- How to load and transform it
- Performance considerations
- ACID guarantees
Master SQL and relational databases. Everything else builds on that foundation.
Need help implementing this in your company?
For delivery-focused missions (Data Engineering, Architecture Data, Data Product Owner), visit ISData Consulting.