Ubuntu: The Linux Distribution Data Engineers Choose
What is Ubuntu?
Ubuntu is a Linux distribution. Free, open-source, reliable operating system.
It’s what data engineers use on their servers and often locally.
Simple: Ubuntu is Linux made easy.
Why Ubuntu Matters for Data Engineers
Server standard: Most production data infrastructure runs on Ubuntu.
Free: Zero cost. Download and use forever.
Reliable: Rock-solid stability. Proven in production for 20+ years.
Community: Massive community. Help and documentation everywhere.
Container-friendly: Docker runs best on Linux. Ubuntu is perfect.
Easy: Easier than other Linux distributions. Good for beginners.
Ubuntu vs Other Operating Systems
Windows:
- Familiar to most people
- Not standard for servers
- Harder to run data tools
macOS:
- Good for development
- Common in tech companies
- Expensive hardware
Ubuntu/Linux:
- Standard for servers and data infrastructure
- Free
- Powerful command line
- Professional
Professional data engineers use Linux. Ubuntu is the easiest entry point.
Installing Ubuntu
Option 1: Virtual Machine (Easiest for Windows/Mac users)
# Download VirtualBox (free)
# Download Ubuntu ISO (free)
# Create VM
# Boot from ISO
# Follow installation wizard
# Done
Option 2: Windows Subsystem for Linux (WSL)
# On Windows 11 (PowerShell as admin)
wsl --install -d Ubuntu
# Ubuntu runs inside Windows
# Same power as full Ubuntu
Option 3: Dual Boot
- Wipe half your disk
- Install Ubuntu
- Choose at startup which OS to boot
Option 4: Dedicate a server
- Buy a server
- Install Ubuntu on it
- Run 24/7
For learning: WSL or Virtual Machine. Easiest.
First Steps in Ubuntu
Open terminal:
Ctrl + Alt + T
Terminal appears. You’re in the shell.
Know where you are:
pwd
# /home/issa
See what’s here:
ls
# Desktop Documents Downloads Music Pictures Public Templates Videos
Create a folder:
mkdir my_project
cd my_project
pwd
# /home/issa/my_project
That’s navigation. Foundation of Linux.
Essential Ubuntu Commands
File operations:
# List files
ls
ls -la # Detailed list
ls -la *.txt # Only .txt files
# Create file
touch myfile.txt
# View file
cat myfile.txt
# Edit file
nano myfile.txt # Simple editor
vim myfile.txt # Powerful editor
# Copy
cp file.txt file_copy.txt
# Move/rename
mv file.txt newname.txt
# Delete
rm file.txt
# Delete folder (careful!)
rm -r folder_name
Directory operations:
# Make directory
mkdir folder
# Go to directory
cd folder
# Go back
cd ..
# Go home
cd ~
# Go to absolute path
cd /home/issa/projects
Text processing:
# View file
cat file.txt
# First 10 lines
head file.txt
# Last 10 lines
tail file.txt
# Search
grep "pattern" file.txt
# Count lines
wc -l file.txt
# Sort
sort file.txt
# Unique lines
uniq file.txt
Permissions:
# View permissions
ls -la
# Change permissions (read, write, execute)
chmod 755 script.sh
# Make executable
chmod +x script.sh
System info:
# Who am I?
whoami
# Disk space
df -h
# Memory usage
free -h
# Running processes
ps aux
# Kill process
kill 1234
Package Management (Installing Software)
Ubuntu uses apt to install software.
Update package list:
sudo apt update
Install software:
sudo apt install python3
sudo apt install postgresql
sudo apt install git
Remove software:
sudo apt remove python3
Search for software:
apt search postgresql
Check version:
python3 --version
postgresql --version
git --version
Install from source (when apt doesn’t have it):
wget https://example.com/software.tar.gz
tar -xzf software.tar.gz
cd software
./configure
make
sudo make install
Users and Permissions
Current user:
whoami
# issa
Become admin (sudo):
sudo apt install python3
# Prompts for password
Create new user:
sudo useradd john
sudo passwd john # Set password
Change to another user:
su john
Edit sudo permissions (advanced):
sudo visudo
# Very careful here. Can break system
Scripting in Ubuntu
Write scripts to automate tasks.
Create script:
nano backup.sh
Write script:
#!/bin/bash
# Backup script
echo "Starting backup..."
cp -r /home/issa/data /backup/data_$(date +%Y%m%d)
echo "Backup complete"
Make executable:
chmod +x backup.sh
Run it:
./backup.sh
# Starting backup...
# Backup complete
Schedule it (run daily at 2 AM):
crontab -e
# Add this line:
0 2 * * * /home/issa/backup.sh
# Save and exit
Boom. Automated daily backups.
Real Ubuntu Data Engineering Setup
# Update system
sudo apt update
sudo apt upgrade
# Install Python
sudo apt install python3 python3-pip
# Install PostgreSQL
sudo apt install postgresql postgresql-contrib
# Install Git
sudo apt install git
# Install Docker
sudo apt install docker.io
# Start services
sudo systemctl start postgresql
sudo systemctl start docker
# Create project
mkdir my_data_project
cd my_data_project
# Create Python virtual environment
python3 -m venv venv
source venv/bin/activate
# Install libraries
pip install pandas sqlalchemy requests
# Write pipeline
nano pipeline.py
# Run it
python pipeline.py
That’s a professional data engineering setup. Free. Powerful.
System Services
Start and stop services:
# Start PostgreSQL
sudo systemctl start postgresql
# Stop PostgreSQL
sudo systemctl stop postgresql
# Restart PostgreSQL
sudo systemctl restart postgresql
# Check if running
sudo systemctl status postgresql
# Start on boot
sudo systemctl enable postgresql
# Disable on boot
sudo systemctl disable postgresql
# See all services
systemctl list-units --type=service
File Permissions Explained
ls -la
# -rw-r--r-- 1 issa issa 1024 Dec 4 10:00 file.txt
Breaking it down:
-= file (d = directory)rw-= owner can read and writer--= group can readr--= others can readissa= ownerissa= group
# 755 = rwxr-xr-x (owner full, others read+execute)
chmod 755 script.sh
# 644 = rw-r--r-- (owner read+write, others read)
chmod 644 file.txt
# 600 = rw------- (owner only, secret files)
chmod 600 .ssh/id_rsa
Networking
# Check IP address
ip addr
# Test connection
ping google.com
# DNS lookup
nslookup example.com
# View network connections
netstat -an
# Copy over network (SSH)
scp file.txt user@server:/remote/path
# Connect to server
ssh user@server
File Transfer
Copy files between computers:
# Copy from local to server
scp myfile.txt user@192.168.1.100:/home/user/
# Copy from server to local
scp user@192.168.1.100:/home/user/file.txt ./
# Copy folder
scp -r folder/ user@server:/remote/path
Environment Variables
Store configuration:
# View environment variables
env
# Set variable for this session
export API_KEY=secret123
# View one variable
echo $API_KEY
# Make permanent (add to ~/.bashrc)
nano ~/.bashrc
# Add this line:
export API_KEY=secret123
# Reload
source ~/.bashrc
Processes and Monitoring
# See running processes
ps aux
# Find specific process
ps aux | grep python
# Monitor in real-time
top
# Press 'q' to exit
# Memory usage
free -h
# Disk usage
du -sh /home/issa/
# CPU usage
vmstat 1
Ubuntu Desktop vs Server
Desktop (GUI):
- Visual interface
- File manager
- Easy for learning
- Uses more resources
Server (command line):
- No GUI
- Terminal only
- Lighter weight
- Standard for production
As data engineer: Learn both. Desktop for learning. Server for production.
Real Data Engineering Workflow on Ubuntu
Day 1: Setup
# Install tools
sudo apt install python3 git postgresql docker.io
# Clone project
git clone https://github.com/company/data-pipeline.git
cd data-pipeline
# Create environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Test locally
python pipeline.py
Day 2: Development
# Make changes
nano pipeline.py
# Test
python pipeline.py
# Version control
git add pipeline.py
git commit -m "Fix SQL query"
git push
Day 3: Deployment
# Build Docker image
docker build -t pipeline:v2 .
# Test in Docker
docker run pipeline:v2
# Push to server
docker push registry.company.com/pipeline:v2
# Server pulls and runs it
ssh user@server
docker pull registry.company.com/pipeline:v2
docker run pipeline:v2
That’s real Ubuntu data engineering.
Ubuntu Tips and Tricks
Search for files:
find / -name "*.py" 2>/dev/null
Find and replace in files:
sed -i 's/old/new/g' file.txt
Redirect output:
python script.py > output.txt 2>&1
Pipe commands:
cat data.csv | grep "2025" | wc -l
Run in background:
python pipeline.py &
Check recent commands:
history
Run last command:
!!
Ubuntu Community
Ask for help:
- Ask Ubuntu (askubuntu.com)
- Ubuntu Forums
- Stack Overflow
- Reddit (/r/Ubuntu, /r/linux)
Massive community. Someone’s already solved your problem.
Bottom Line
Ubuntu is the Linux distribution for data engineers.
Free. Powerful. Professional. Reliable.
Learn Ubuntu. Learn the command line. Learn basic system administration.
You’ll be more effective. More professional. More valuable.
Most production data infrastructure runs on Linux. Ubuntu is your entry point.
Start today. Download. Install. Learn.
Need help implementing this in your company?
For delivery-focused missions (Data Engineering, Architecture Data, Data Product Owner), visit ISData Consulting.