Linux Fundamentals: How It Works and Why Data Engineers Use It

November 24, 2024

What is Linux?

Linux is an operating system kernel. Free and open-source.

The operating system is the bridge between you and the computer hardware.

Linux powers:

99% of data centers
Most servers in the world
Cloud infrastructure
Data engineering pipelines

If you work with data professionally, Linux is everywhere.

Unix Philosophy

Linux is based on Unix. Unix has a philosophy:

Do one thing well:

ls      # List files
grep    # Search text
cat     # Show content

Each tool does one job. Combine them:

cat data.csv | grep "2025" | wc -l
# Show file, search for 2025, count matches

Simple. Powerful. Composable.

Linux vs Windows vs macOS

Linux:

Free
Open-source
Powerful command line
Industry standard for servers
No GUI by default
Steep learning curve (initially)

Windows:

Familiar to most people
Good GUI
Expensive
Closed source
Weak command line historically
Not standard for servers

macOS:

Built on Unix (similar to Linux)
Good for development
Expensive hardware
Closed source
Easier command line than Windows
But not standard for servers

Professional servers run Linux. That’s a fact.

The Linux Filesystem

Linux organizes files in a tree:

/
├── bin/        # Essential commands
├── home/       # User home directories
│   ├── alice/
│   ├── bob/
│   └── issa/
├── etc/        # Configuration files
├── var/        # Variable data (logs, databases)
├── tmp/        # Temporary files
├── usr/        # User programs
├── opt/        # Optional software
└── root/       # Root user's home

Key paths:

/home/issa       # My home directory (~)
/home/issa/.ssh  # SSH keys
/home/issa/.bashrc # Shell configuration
/etc/postgresql  # PostgreSQL config
/var/log         # System logs
/tmp             # Temporary files

Users and Permissions

Root: Administrator. Can do anything. Dangerous.

Regular user: Limited permissions. Safe.

Groups: Users belong to groups with shared permissions.

# I am user 'issa' in groups 'issa' and 'docker'
id
# uid=1000(issa) gid=1000(issa) groups=1000(issa),999(docker)

Permission format:

ls -la myfile.txt
-rw-r--r-- 1 issa issa 1024 Dec 4 10:00 myfile.txt

Breaking it down:

- = file
rw- = owner (me) can read and write
r-- = group can read
r-- = others can read

Change permissions:

# Owner only
chmod 600 secret.txt

# Owner read+write+execute, others read+execute
chmod 755 script.sh

# Make executable
chmod +x script.sh

The Shell (Command Line)

The shell is how you talk to Linux.

Bash: The most common shell.

Prompts:

# Regular user
issa@ubuntu:~$

# Root user
root@ubuntu:~#

Commands:

# Syntax
command [options] [arguments]

# Examples
ls -la /home/issa
grep "error" logfile.txt
python3 script.py --verbose

Pipes (connect commands):

cat data.csv | grep "2025" | sort | uniq
# Show file, filter, sort, remove duplicates

Redirection (redirect output):

# Save to file
python script.py > output.txt

# Append to file
echo "log" >> log.txt

# Redirect error
python script.py 2> errors.txt

# Both output and error
python script.py > output.txt 2>&1

File Operations

Navigate:

pwd                    # Where am I?
cd /home/issa          # Go to directory
cd ..                  # Go up
cd ~                   # Go home
ls -la                 # List with details

Create and view:

touch file.txt         # Create empty file
cat file.txt           # View content
head -20 file.txt      # First 20 lines
tail -20 file.txt      # Last 20 lines
wc -l file.txt         # Line count

Edit:

nano file.txt          # Simple editor
vim file.txt           # Powerful editor
sed -i 's/old/new/g' file.txt  # Find and replace

Copy, move, delete:

cp file.txt copy.txt       # Copy
cp -r folder/ copy_folder/ # Copy folder
mv file.txt newname.txt    # Move/rename
rm file.txt                # Delete
rm -r folder/              # Delete folder (careful!)

Permissions in Detail

rwx = read, write, execute

# 755 in binary: 111 101 101
# Owner: 111 = 7 = read+write+execute
# Group: 101 = 5 = read+execute
# Other: 101 = 5 = read+execute
chmod 755 script.sh

# 644 in binary: 110 100 100
# Owner: 110 = 6 = read+write
# Group: 100 = 4 = read
# Other: 100 = 4 = read
chmod 644 file.txt

# 600 in binary: 110 000 000
# Owner: 110 = 6 = read+write
# Group: 000 = 0 = no access
# Other: 000 = 0 = no access
chmod 600 secret.txt

Environment and Variables

The environment stores configuration.

# View all variables
env

# View specific variable
echo $HOME
echo $PATH

# Set variable (this session only)
export API_KEY=secret123

# Use variable
curl -H "Authorization: $API_KEY" api.example.com

# Make permanent (add to ~/.bashrc)
nano ~/.bashrc
# Add: export API_KEY=secret123
source ~/.bashrc

Package Management

Different distributions use different package managers:

Ubuntu/Debian (apt):

sudo apt update              # Update package list
sudo apt install python3     # Install
sudo apt remove python3      # Remove
sudo apt upgrade             # Upgrade all
sudo apt search python       # Search

RedHat/CentOS (yum):

sudo yum install python3
sudo yum remove python3
sudo yum update

Same concepts. Different syntax.

System Administration

Check disk space:

df -h              # Disk free
du -sh /home/issa/ # Folder size

View logs:

tail -f /var/log/syslog    # System log
tail -f /var/log/auth.log  # Auth log
journalctl -xe             # Journal

Monitor processes:

ps aux                     # All processes
ps aux | grep python       # Find process
top                        # Monitor live
kill 1234                  # Kill process

Manage services:

sudo systemctl start postgresql    # Start
sudo systemctl stop postgresql     # Stop
sudo systemctl restart postgresql  # Restart
sudo systemctl status postgresql   # Status
sudo systemctl enable postgresql   # Start on boot

Networking

Check connectivity:

ping google.com             # Test connection
curl https://api.example.com  # Make HTTP request
wget https://example.com/file.zip  # Download

Network info:

ip addr                    # IP address
ip route                   # Routing
netstat -an                # Connections
nslookup example.com       # DNS lookup

Secure shell (SSH):

ssh user@example.com          # Connect
ssh -i key.pem user@server    # With private key
scp file.txt user@server:~/   # Copy file

Text Processing

Search:

grep "pattern" file.txt         # Find pattern
grep -i "pattern" file.txt      # Case insensitive
grep -r "pattern" /folder/      # Recursive
grep -E "regex.*pattern" file   # Regex

Process text:

cat file.txt          # Show
head -10 file.txt     # First 10 lines
tail -10 file.txt     # Last 10 lines
wc -l file.txt        # Count lines
sort file.txt         # Sort
uniq file.txt         # Unique lines
cut -d, -f1 file.csv  # Extract column 1

Find and replace:

sed 's/old/new/' file.txt       # Replace first
sed 's/old/new/g' file.txt      # Replace all
sed -i 's/old/new/g' file.txt   # In-place edit

Scripting

Write reusable scripts:

#!/bin/bash
# Comment

# Variables
NAME="issa"
AGE=30

# Echo output
echo "Hello $NAME"

# Conditionals
if [ "$AGE" -gt 18 ]; then
    echo "Adult"
else
    echo "Minor"
fi

# Loops
for i in 1 2 3; do
    echo "Number $i"
done

# Functions
backup() {
    echo "Backing up..."
    cp -r /data /backup/data_$(date +%Y%m%d)
}

backup

Make executable and run:

chmod +x script.sh
./script.sh

Cron Jobs (Scheduling)

Run scripts automatically on schedule:

# Edit cron
crontab -e

# Add jobs:
0 2 * * * /home/issa/backup.sh          # Daily at 2 AM
0 */6 * * * /home/issa/clean.sh         # Every 6 hours
0 0 * * 0 /home/issa/weekly_report.sh  # Weekly Sunday midnight

Format:

minute hour day month weekday command
0      2    *   *     *       /script.sh

Data Engineer Workflow on Linux

Extract data:

psql -U user -d database -c "SELECT * FROM orders;" > data.csv

Transform data (using tools):

cat data.csv | cut -d, -f1,2 | grep "2025" > filtered.csv

Transform data (using Python):

python3 transform.py < raw.csv > processed.csv

Load data:

psql -U user -d database -c "\COPY processed FROM 'data.csv' WITH CSV"

Schedule pipeline:

# Add to crontab
0 2 * * * /home/issa/data_pipeline.sh

All coordinated through Linux.

Linux File Ownership

# View owner
ls -la myfile.txt
# -rw-r--r-- 1 issa issa 1024 Dec 4 10:00 myfile.txt
#              owner group

# Change owner
sudo chown alice:alice myfile.txt

# Change group
sudo chgrp developers myfile.txt

# Both
sudo chown alice:developers myfile.txt

The Power of Linux

Composability: Combine simple tools.

find . -name "*.log" | xargs wc -l | sort -rn | head -5
# Find log files, count lines, sort by size, show top 5

Automation: Scripts run unattended.

# Backup runs at 2 AM every day automatically
0 2 * * * /backup.sh

Remote work: SSH into servers from anywhere.

ssh user@data.company.com
# You're on the server. Run commands. Work remotely.

Scripting: Automate everything.

# One script extracts, transforms, loads, monitors
# Runs daily automatically

Linux Learning Path

Week 1: Navigation and basic commands

cd, ls, cat, nano, cp, mv, rm

Week 2: Permissions and users

chmod, chown, sudo, useradd

Week 3: Pipes and redirection

|, >, >>, 2>, grep, sed

Week 4: Scripting and automation

Bash scripts, cron jobs

Week 5+: System administration

Services, logs, networking

Bottom Line

Linux is the foundation of data infrastructure.

Understanding Linux means:

You can work on servers
You can automate tasks
You understand how systems work
You’re more valuable

Master the fundamentals. The rest builds on that.

Start with Ubuntu. Learn the command line. You’ll become a better engineer.

Linux is not optional for data engineers. It’s essential.

Need help implementing this in your company?

For delivery-focused missions (Data Engineering, Architecture Data, Data Product Owner), visit ISData Consulting.

Visit isdataconsulting.com Book a discovery call →