Skip to main content

System Engineering & DevOps

Home/projects/System Engineering & DevOps

[DevOps]

System Engineering & DevOps

A full progression through Linux, shell scripting, networking, web server management, configuration management with Puppet, HTTPS/SSL, API scripting, and web stack debugging at ALX/Holberton.

BashLinuxNginxHAProxyPuppetPythonSSHufwDatadogRuby
View repository

A structured progression through everything that happens below the application layer: shell scripting, Linux permissions and file I/O, networking concepts, web server configuration, infrastructure automation with Puppet, HTTPS termination with HAProxy, API scripting in Python, and systematic web stack debugging. Every deliverable is a script or manifest that can be run directly on a server — nothing theoretical.


Shell Basics

Before writing any server automation, the fundamentals of navigating and manipulating a Linux filesystem were drilled through focused one-liners. Each script is a single command that solves one problem exactly.

# Print the current working directory
pwd

# List all files including hidden, in long format
ls -al

# Navigate to home directory
cd ~

# Navigate back to the previous directory
cd -

# Create a nested directory tree in one command
mkdir -p welcome/to/school

# Move all files with uppercase names to /tmp/u
mv [[:upper:]]* /tmp/u

# Delete all Emacs backup files (ending in ~)
rm *~

# Create a symbolic link pointing /bin/ls to __ls__
ln -s /bin/ls __ls__

# Copy HTML files to parent directory only if newer or nonexistent
cp -un *.html ../

# List files in multiple directories in one call
ls -la . .. /boot

Shell Permissions

Linux permission bits are set with both symbolic and octal notation. Both forms were practiced so the relationship between them is intuitive.

# Switch to user betty
su betty

# Print the current username
id -un

# Add execute permission for the owner
chmod u+x hello

# Set owner execute, group execute, and other read in one call
chmod u+x,g+x,o+r hello

# Grant execute to all (user, group, other) using combined notation
chmod ugo+x hello

# Octal: only other gets full permissions — the 007 pattern
chmod 007 hello

# Octal: owner rwx, group r-x, other -wx = 753
chmod 753 hello

# Mirror permissions from one file to another
chmod --reference=olleh hello

# Recursively add execute for directories only (capital X skips regular files)
chmod -R ugo+X .

# Create a directory with permissions already set to 751
mkdir -m 751 my_dir

# Change owner
chown betty hello

# Change group
chgrp school hello

# Change owner and group recursively, including symlinks (-h)
chown -hR vincent:staff .

# Change owner only if current owner matches a specific user
chown --from=guillaume betty hello

Shell Redirections and Text Processing

The redirections module covers the full pipeline toolkit: grep, cut, sort, uniq, tr, awk, find, and rev — each composable with pipes.

# Redirect ls output to a file
ls -la > ls_cwd_content

# Append the last line of a file back to itself
tail -n 1 iacta >> iacta

# Extract the third line of a file without head/tail tricks
head -n 3 iacta | tail -n 1

# Count all subdirectories (excluding . itself)
find . -type d -not -name '.' | wc -l

# List 10 newest files, most recent first
ls -t1 | head -n 10

# Print lines that appear exactly once (unique lines only)
sort | uniq -u

# Case-insensitive search for "root" in /etc/passwd
grep -i "root" /etc/passwd

# Print 3 lines after each match
grep -i "root" -A 3 /etc/passwd

# Print only lines NOT containing "bin"
grep -i -v "bin" /etc/passwd

# Match lines starting with a letter (filter comment lines in sshd_config)
grep -i "^[a-z]" /etc/ssh/sshd_config

# Translate characters: A→Z, c→e
tr "A" "Z" | tr "c" "e"

# Delete all occurrences of c and C from stdin
tr -d "cC"

# Reverse each line of input
rev

# Print username and home directory from /etc/passwd, sorted
cut -d ':' -f 1,6 /etc/passwd | sort

# Find all empty files/dirs, print just the filename (not the path)
find . -empty | rev | cut -d '/' -f 1 | rev

# Find .gif files, strip extension and path, sort case-insensitively
find -type f -name "*.gif" | rev | cut -d "/" -f 1 | cut -d '.' -f 2- | rev | LC_ALL=C sort -f

# Recursively delete all .js files
find . -type f -name "*.js" -delete

# Extract first character of each line and join into one string (acrostic)
cut -c 1 | paste -s -d ''

The most complex pipeline in the module parses a Twitter log to find the top 11 most prolific tweeters — skip the header, extract field 1, count unique occurrences, sort by frequency descending, take the top 11, then strip the count prefix:

# 0x02-shell_redirections/103-the_biggest_fan

tail -n +2 | cut -f -1 | sort -k 1 | uniq -c | sort -rnk 1 | head -n 11 | rev | cut -d ' ' -f -1 | rev

Shell Variables and Expansions

Bash arithmetic, environment variables, and base conversion in pure shell.

# Add /action to PATH without overwriting existing entries
export PATH=$PATH:/action

# Count path entries by counting ':/' occurrences and adding 1
echo $((`echo $PATH | grep -o ":/" | wc -l`+ 1))

# Print all environment variables
printenv

# Print all variables (local + global)
set

# Create a local variable
BEST="School"

# Create and export a global variable
export BEST=School

# Arithmetic expansion: add 128 to an environment variable
echo $(($TRUEKNOWLEDGE + 128))

# Integer division using two environment variables
echo $(($POWER / $DIVIDE))

# Exponentiation using two environment variables
echo $((BREATH**$LOVE))

# Convert binary string in $BINARY to decimal
echo "$((2#$BINARY))"

# Convert decimal in $DECIMAL to hex using printf
printf '%x\n' $DECIMAL

# Print a float with 2 decimal places
printf "%.2f" $NUM | sort

# Generate all two-letter combinations from aa to zz, excluding "oo"
echo {a..z}{a..z} | tr " " "\n" | grep -v "oo"

# ROT13 in one line using tr character ranges
tr 'A-Za-z' 'N-ZA-Mn-za-m'

# Print only odd-numbered lines using Perl
perl -lne 'print if $. % 2 ==1'

The most unusual script in this module encodes two words ($WATER and $STIR) in base 5, adds them, converts to octal, then maps the digits to letters from a custom alphabet:

# 0x03-shell_variables_expansions/103-water_and_stir

echo $(printf %o $(($((5#$(echo $WATER | tr 'water' '01234'))) + $((5#$(echo $STIR | tr 'stir.' '01234'))))) | tr '01234567' 'bestchol')

Loops, Conditions, and Parsing

Three loop types (for, while, until) and case statements, all applied to real parsing tasks.

# for loop — print "Best School" 10 times
for (( c=1; c<=10; c++ ))
do
    echo "Best School"
done

# while loop equivalent
COUNTER=0
while [ $COUNTER -lt 10 ]
do
    echo "Best School"
    let COUNTER=COUNTER+1
done

# until loop — count down instead of up
COUNTER=10
until [ $COUNTER -lt 1 ]
do
    echo "Best School"
    let COUNTER-=1
done

case is cleaner than chained elif for discrete value matching:

# 0x04-loops_conditions_and_parsing/6-superstitious_numbers

COUNTER=1
while [ $COUNTER -lt 21 ]
do
    case $COUNTER in
        4)  echo $COUNTER; echo "bad luck from China" ;;
        9)  echo $COUNTER; echo "bad luck from Japan" ;;
        17) echo $COUNTER; echo "bad luck from Italy" ;;
        *)  echo $COUNTER ;;
    esac
    let COUNTER+=1
done

File existence testing with compound conditions:

# 0x04-loops_conditions_and_parsing/9-to_file_or_not_to_file

FILE="school"
if [ -e "$FILE" ]; then
    echo "school file exists"
    if [ ! -s "$FILE" ]; then
        echo "school file is empty"
    else
        echo "school file is not empty"
    fi
    if [ -f "$FILE" ]; then
        echo "school is a regular file"
    fi
else
    echo "school file does not exist"
fi

The while IFS=: read -r pattern is the correct way to parse colon-delimited files like /etc/passwd without relying on cut in a loop:

# 0x04-loops_conditions_and_parsing/101-tell_the_story_of_passwd

while IFS=: read -r f1 f2 f3 f4 f5 f6 f7
do
    echo "The user $f1 is part of the $f4 gang, lives in $f6 and rides $f7. \
$f3's place is protected by the passcode $f2, more info about the user here: $f5"
done < /etc/passwd

Apache log parsing with awk — extract IP and HTTP status, count occurrences, sort by frequency:

# 0x04-loops_conditions_and_parsing/102-lets_parse_apache_logs
awk '{print $1,$9}' apache-access.log

# 0x04-loops_conditions_and_parsing/103-dig_the-data — with frequency count
awk '{print $1,$9}' apache-access.log | sort | uniq -c | sort -nr

Regular Expressions (Ruby)

The regex exercises use Ruby's .scan method — a useful contrast to grep because it returns an array of matches rather than lines.

# Match literal "School"
puts ARGV[0].scan(/School/).join

# Match "hbt" with 2–5 t's: hbttn, hbtttn, hbtttn, hbttttn, hbtttttn
puts ARGV[0].scan(/hbt{2,5}n/).join

# Match exactly one t (lazy quantifier)
puts ARGV[0].scan(/hb{1}?tn/).join

# Match one or more t's (+ quantifier)
puts ARGV[0].scan(/hbt+n/).join

# Match zero or more t's (* quantifier)
puts ARGV[0].scan(/hbt*n/).join

# Match exactly a 3-character string: starts with h, ends with n
puts ARGV[0].scan(/^h.n$/).join

# Validate a phone number: 1–10 digits, nothing else
puts ARGV[0].scan(/^\d{1,10}$/).join

# Extract only uppercase characters
puts ARGV[0].scan(/[A-Z]+/).join

The most complex regex in the module extracts sender, recipient, and flags from a raw SMS log format using lookbehind assertions:

# 0x06-regular_expressions/100-textme.rb

puts ARGV[0].scan(/(?<=from|to|flags):(\+?\w+|[-?[0-1]:?]+)/).join(',')

Networking Basics

The networking module covers OSI model concepts, TCP/UDP differences, and hands-on diagnostic commands.

# Show all listening sockets with PIDs
sudo netstat -l --program

# Ping an IP address for 5 seconds, with argument validation
if [ "$1" ]; then
    ping -w 5 "$1"
else
    echo "Usage: 5-is_the_host_on_the_network {IP_ADDRESS}"
fi

Manipulating /etc/hosts directly (useful for testing DNS before propagation):

# 0x08-networking_basics_2/0-change_your_home_IP

cp /etc/hosts ~/hosts.new
sed -i s/127.0.0.1/127.0.0.2/ ~/hosts.new      # remap localhost
echo "8.8.8.8 facebook.com" >> ~/hosts.new       # override facebook DNS
cp -f ~/hosts.new /etc/hosts

Display all active IPv4 addresses on the machine:

ifconfig | grep 'inet addr:' | cut -d: -f2 | cut -d" " -f1

Open a port and listen for incoming connections — useful for testing connectivity between machines:

nc -l 98

SSH Configuration

SSH key generation with 4096-bit RSA and a passphrase:

# 0x0B-ssh/1-create_ssh_key_pair

ssh-keygen -b 4096 -f school -t rsa -N betty

Connecting using a specific private key:

ssh -i ~/.ssh/school ubuntu@18.205.38.219

The SSH client config disables password authentication and specifies the identity file — so ssh server just works without flags:

# 0x0B-ssh/2-ssh_config

Host *
      PasswordAuthentication no
      IdentifyFile ~/.ssh/school

The same config can be enforced via Puppet's file_line resource, which ensures specific lines exist in the config file without overwriting the whole thing:

# 0x0B-ssh/100-puppet_ssh_config.pp

include stdlib

file_line { 'Turn off passwd auth':
  ensure  => present,
  path    => '/etc/ssh/ssh_config',
  line    => '    PasswordAuthentication no',
  replace => true,
}

file_line { 'Declare identity file':
  ensure  => present,
  path    => '/etc/ssh/ssh_config',
  line    => '     IdentityFile ~/.ssh/school',
  replace => true,
}

Web Server Setup

Installing Nginx, writing a custom config, and adding a redirect rule — all in a single idempotent Bash script that can run on a fresh server:

# 0x0C-web_server/3-redirection (abbreviated)

apt-get update
apt-get -y install nginx
sudo ufw allow 'Nginx HTTP'
mkdir -p /var/www/html/
echo 'Hello World!' > /var/www/html/index.html

SERVER_CONFIG="server {
    listen 80 default_server;
    listen [::]:80 default_server;
    root /var/www/html;
    index index.html index.htm;
    server_name _;

    location / {
        try_files \$uri \$uri/ =404;
    }

    if (\$request_filename ~ redirect_me){
        rewrite ^ https://sketchfab.com/bluepeno/models permanent;
    }
}"

bash -c "echo -e '$SERVER_CONFIG' > /etc/nginx/sites-enabled/default"

if [ "$(pgrep -c nginx)" -le 0 ]; then
    service nginx start
else
    service nginx restart
fi

The 404 page version adds a custom error page and demonstrates error_page with an internal location block (not directly accessible via URL):

error_page 404 /404.html;
location /404.html {
    internal;
}

The same Nginx setup can be expressed as a Puppet manifest, making it reproducible and version-controlled. The manifest manages the package, config file, index file, 404 page, and service state as separate resources:

# 0x0C-web_server/7-puppet_install_nginx_web_server.pp (abbreviated)

package { 'nginx':
  ensure => 'installed',
}

file { '/var/www/html/index.html':
  content => "Hello World!\n",
}

file { 'Nginx default config file':
  ensure  => file,
  path    => '/etc/nginx/sites-enabled/default',
  content => "server {
    listen 80 default_server;
    ...
    error_page 404 /404.html;
    location /404.html { internal; }
    if (\$request_filename ~ redirect_me){
        rewrite ^ https://www.youtube.com/watch?v=QH2-TGUlwu4 permanent;
    }
}",
}

service { 'nginx':
  ensure  => running,
  require => Package['nginx'],
}

Web Stack Debugging

Each debugging exercise presents a broken server and requires a fix script that leaves the system in a working state.

Debugging 0 — Apache fails to start because ServerName is not set, causing the Could not reliably determine the server's fully qualified domain name error on startup:

# 0x0D-web_stack_debugging_0/0-give_me_a_page

echo "ServerName localhost" >> /etc/apache2.conf
service apache2 start

Debugging 1 — Nginx is configured to listen on port 8080 instead of 80. The fix uses sed to rewrite the port in-place:

# 0x0E-web_stack_debugging_1/1-debugging_made_short

sed -i "s/8080/80/g" /etc/nginx/sites-enabled/default
service nginx restart
echo "" > /run/nginx.pid

A broken symlink caused the initial issue in the same module — sites-enabled/default was pointing nowhere:

# 0x0E-web_stack_debugging_1/0-nginx_likes_port_80

rm /etc/nginx/sites-enabled/default
ln -s /etc/nginx/sites-available/default /etc/nginx/sites-enabled/default
service nginx restart

Debugging 2 — Nginx is running as root; the fix makes it run as the nginx user. The compressed version does the same job in fewer lines with extended regex:

# 0x12-web_stack_debugging_2/100-fix_in_7_lines_or_less

pkill -f apache2
chmod 644 /etc/nginx/nginx.conf
sed -Ei 's/\s*#?\s*user .*/user nginx;/' /etc/nginx/nginx.conf
sed -Ei 's/(listen (\[::\]:)?80) /\180 /' /etc/nginx/sites-enabled/default
sudo -u nginx service nginx restart

Debugging 3 — A WordPress site on Apache returns 500 errors. strace reveals that PHP files are being required with a .phpp extension typo in wp-settings.php. The Puppet fix patches it with sed:

# 0x17-web_stack_debugging_3/0-strace_is_your_friend.pp

exec { 'fix-wordpress':
  command => 'sed -i s/phpp/php/g /var/www/html/wp-settings.php; sudo service apache2 restart',
  path    => ['/bin', '/usr/bin', '/usr/sbin']
}

Debugging 4 — Nginx hits file descriptor limits under load. The fix bumps the ULIMIT from 15 to 4096 in /etc/default/nginx:

# 0x1B-web_stack_debugging_4/0-the_sky_is_the_limit_not.pp

exec { 'fix--for-nginx':
  command => 'sed -i "s/15/4096/" /etc/default/nginx',
  path    => '/usr/local/bin/:/bin/'
} ->

exec { 'nginx-restart':
  command => 'nginx restart',
  path    => '/etc/init.d/'
}

The same module also fixes user file descriptor limits, raising both soft and hard limits for a specific user in /etc/security/limits.conf:

# 0x1B-web_stack_debugging_4/1-user_limit.pp

exec { 'increase-hard-file-limit-for-holberton-user':
  command => 'sed -i "/holberton hard/s/5/50000/" /etc/security/limits.conf',
  path    => '/usr/local/bin/:/bin/'
}

exec { 'increase-soft-file-limit-for-holberton-user':
  command => 'sed -i "/holberton soft/s/4/50000/" /etc/security/limits.conf',
  path    => '/usr/local/bin/:/bin/'
}

Configuration Management with Puppet

Three core Puppet resource types were practiced: file, package, and exec.

# 0x0A-configuration_management/0-create_a_file.pp
# Create a file with specific content, permissions, owner, and group

file { '/tmp/school':
  content => 'I love Puppet',
  mode    => '0744',
  owner   => 'www-data',
  group   => 'www-data',
}
# 0x0A-configuration_management/1-install_a_package.pp
# Install a specific package version using exec (before package resource support)

exec { 'puppet-lint':
  command => '/usr/bin/apt-get -y install puppet-lint -v 2.5.0',
}
# 0x0A-configuration_management/2-execute_a_command.pp
# Kill a process by name using pkill via the shell provider

exec { 'pkill':
  command  => 'pkill killmenow',
  provider => 'shell',
}

The provider => 'shell' attribute is needed when the command uses shell builtins or when pkill isn't in the default exec path.


Firewall with ufw

Block all incoming traffic by default, then punch holes for SSH, HTTP, and HTTPS:

# 0x13-firewall/0-block_all_incoming_traffic_but

sudo ufw enable -y
sudo ufw default deny incoming -y
sudo ufw default allow outgoing -y
sudo ufw allow 22/tcp    # SSH
sudo ufw allow 443/tcp   # HTTPS
sudo ufw allow 80/tcp    # HTTP

The ordering matters — setting defaults before adding exceptions ensures no window where the rules are in an inconsistent state.


HTTPS and HAProxy

HAProxy is configured as a TLS termination proxy in front of two backend web servers. SSL is terminated at the load balancer — backends receive plain HTTP on port 80. The config shows two stages: the initial TLS setup, then adding a 301 redirect to force all HTTP to HTTPS:

# 0x10-https_ssl/1-haproxy_ssl_termination

frontend th3gr00t-tech-frontend
    bind *:80
    bind *:443 ssl crt /etc/haproxy/certs/th3gr00t.tech.pem
    http-request redirect scheme https unless { ssl_fc }
    http-request set-header X-Forwarded-Proto https
    default_backend th3gr00t-tech-backend

backend th3gr00t-tech-backend
    balance roundrobin
    server 453-web-01 35.243.128.200:80 check
    server 453-web-02 3.239.120.96:80 check

Adding code 301 makes the redirect permanent (cached by browsers):

# 0x10-https_ssl/100-redirect_http_to_https

http-request redirect scheme https code 301 unless { ssl_fc }

The subdomain inspector script uses dig to query DNS records and report the record type and IP for any subdomain:

# 0x10-https_ssl/0-world_wide_web (abbreviated)

dig_cmd () {
    INFO="$(dig "$2.$domain" | grep -A1 'ANSWER SECTION:' | awk 'NR==2')"
    IP=$(echo "$INFO" | awk '{print $5}')
    RECORD=$(echo "$INFO" | awk '{print $4}')
    echo "The subdomain $2 is a $RECORD record and points to $IP"
}

if [ $# -eq 1 ]; then
    for subs in "www" "lb-01" "web-01" "web-02"; do
        dig_cmd "$domain" "$subs"
    done
elif [ $# -eq 2 ]; then
    dig_cmd "$domain" "$sub"
fi

API Scripting in Python

Four scripts consume the JSONPlaceholder REST API, progressively building from a single user query to a full multi-user export.

Script 0 — Fetch and display a user's TODO progress

# 0x15-api/0-gather_data_from_an_API.py

if __name__ == '__main__':
    if len(sys.argv) > 1:
        if re.fullmatch(r'\d+', sys.argv[1]):
            id = int(sys.argv[1])
            user_res  = requests.get('{}/users/{}'.format(API, id)).json()
            todos_res = requests.get('{}/todos'.format(API)).json()

            user_name  = user_res.get('name')
            todos      = list(filter(lambda x: x.get('userId') == id, todos_res))
            todos_done = list(filter(lambda x: x.get('completed'), todos))

            print('Employee {} is done with tasks({}/{}):'.format(
                user_name, len(todos_done), len(todos)
            ))
            for todo in todos_done:
                print('\t {}'.format(todo.get('title')))

Script 1 — Export to CSV

# 0x15-api/1-export_to_CSV.py

with open('{}.csv'.format(id), 'w') as file:
    for todo in todos:
        file.write('"{}","{}","{}","{}"\n'.format(
            id,
            user_name,
            todo.get('completed'),
            todo.get('title')
        ))

Script 2 — Export to JSON

# 0x15-api/2-export_to_JSON.py

user_data = list(map(
    lambda x: {
        "task":      x.get("title"),
        "completed": x.get("completed"),
        "username":  user_name
    },
    todos
))
with open("{}.json".format(id), 'w') as json_file:
    json.dump({str(id): user_data}, json_file)

Script 3 — All users, all TODOs, one JSON file

# 0x15-api/3-dictionary_of_list_of_dictionaries.py

users_res = requests.get('{}/users'.format(API)).json()
todos_res = requests.get('{}/todos'.format(API)).json()

users_data = {}
for user in users_res:
    id        = user.get('id')
    user_name = user.get('username')
    todos     = list(filter(lambda x: x.get('userId') == id, todos_res))
    users_data[str(id)] = list(map(
        lambda x: {
            'username':  user_name,
            'task':      x.get('title'),
            'completed': x.get('completed')
        },
        todos
    ))

with open('todo_all_employees.json', 'w') as file:
    json.dump(users_data, file)

Reddit API — Recursive Pagination

The advanced API module consumes the Reddit API. All three scripts use allow_redirects=False — Reddit silently redirects invalid subreddits to its search page, which would return a 200 with garbage data. Checking the status code on the original request catches invalid subreddits correctly.

# 0x16-api_advanced/0-subs.py

def number_of_subscribers(subreddit):
    sub_info = requests.get(
        "https://www.reddit.com/r/{}/about.json".format(subreddit),
        headers={"User-Agent": "My-User-Agent"},
        allow_redirects=False  # catch invalid subreddits — don't follow to search
    )
    if sub_info.status_code >= 300:
        return 0
    return sub_info.json().get("data").get("subscribers")

The recursive hot-post fetcher uses Reddit's after cursor for pagination. Each call appends to hot_list and recurses until after is null:

# 0x16-api_advanced/2-recurse.py

def recurse(subreddit, hot_list=[], count=0, after=None):
    sub_info = requests.get(
        "https://www.reddit.com/r/{}/hot.json".format(subreddit),
        params={"count": count, "after": after},
        headers={"User-Agent": "My-User-Agent"},
        allow_redirects=False
    )
    if sub_info.status_code >= 400:
        return None

    hot_l = hot_list + [child.get("data").get("title")
                        for child in sub_info.json().get("data").get("children")]

    info = sub_info.json()
    if not info.get("data").get("after"):
        return hot_l   # base case: no more pages

    return recurse(subreddit, hot_l,
                   info.get("data").get("count"),
                   info.get("data").get("after"))

The word counter extends the recursive pattern by accumulating keyword frequencies across all pages, then printing sorted results only when pagination is exhausted:

# 0x16-api_advanced/100-count.py (abbreviated)

def count_words(subreddit, word_list, word_count={}, after=None):
    # ... fetch page ...

    for title in hot_l:
        split_words = title.split(' ')
        for word in word_list:
            for s_word in split_words:
                if s_word.lower() == word.lower():
                    word_count[word] += 1

    if not info.get("data").get("after"):
        # Only print when all pages are consumed
        sorted_counts = sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)
        [print('{}: {}'.format(k, v)) for k, v in sorted_counts if v != 0]
    else:
        return count_words(subreddit, word_list, word_count,
                           info.get("data").get("after"))

Postmortem

The postmortem report documents a real MongoDB connection pool exhaustion incident — a structured incident write-up following the format used in production engineering.

The timeline established that the database started auto-terminating connections at 10:50 AM due to a sudden traffic spike. The root cause was the connection pool being undersized for burst traffic. Resolution involved tuning the pool settings and adding monitoring — two actions that address different parts of the problem: the immediate fix and the detection gap that let the incident run for 40 minutes before resolution.

Corrective measures from the report:

  • Real-time monitoring of connection pool metrics to detect saturation before it becomes an outage
  • Automatic scaling to handle traffic spikes without manual intervention
  • Load testing to establish capacity limits before they're discovered in production
  • Query optimization to reduce the number of concurrent connections required per unit of throughput