Files
xamxam/docs/VM_Crash_Reports.md
Théophile Gervreau-Mercier 5c5054d744 Investigating VM crash
2026-04-13 11:12:12 +02:00

13 KiB

VM Crash Investigation Report - posterg.erg.be

Date: 2026-03-26
Investigator: Automated Investigation (Limited Access)
Server: theophile@posterg.erg.be:3274

Executive Summary

The VM experienced an unresponsive state requiring a hard reboot on March 24, 2026 at ~12:56 UTC. Investigation was limited by lack of root/adm group access to critical system logs. Initial findings show no obvious application-level issues, but critical system logs require root access for complete analysis.


Timeline

Confirmed Events

  • Last known activity (previous boot): March 2, 2026 15:38:59 UTC
  • Gap period: March 2-24 (22 days) - NO BOOT LOGS AVAILABLE
  • System reboot: March 24, 2026 12:56-12:57 UTC
  • Current uptime: 2 days, 1 hour (as of investigation time)
  • Current system state: Stable, all services running normally

Critical Unknown

There is a 22-day gap in boot records between March 2 and March 24. This could indicate:

  1. System was running continuously during this period and crashed on/before March 24
  2. Multiple unrecorded reboots occurred
  3. Journal corruption or rotation issues

What I Could Access (Non-Root Investigation)

Successfully Checked

1. Current System Health

Memory: 7.8GB total, 464MB used, 5.9GB free (HEALTHY)
Disk: 30GB, 3.2GB used (12% usage - HEALTHY)
Swap: 976MB, 0B used (not being used)
Load Average: 0.00, 0.00, 0.00 (IDLE)

2. Running Services

  • nginx: 4 worker processes running normally
  • php-fpm: Master + 2 workers (PHP 8.4)
  • mariadb: Running (155MB RSS)
  • All services appear healthy with normal memory usage

3. Nginx Configuration Analysis

Location: /etc/nginx/sites-available/posterg

Security Measures Found:

  • Rate limiting configured:
    • General requests: 30 req/min
    • Search endpoint: 30 req/min (burst=10)
    • Admin: 60 req/min (burst=20)
  • Client max body size: 100MB
  • Timeouts: 120 seconds (read/send)
  • HTTP Basic Auth on /admin/ directory

Potential Issues:

  • ⚠️ Rate limits are relatively permissive (30 req/min could allow rapid resource consumption)
  • ⚠️ Large upload size (100MB) combined with multiple concurrent uploads could exhaust memory
  • ⚠️ 120-second timeouts on PHP processing could lead to worker process accumulation

4. Application Architecture

Type: PHP-based thesis repository
Database: SQLite (located in /var/www/posterg/storage/posterg.db)
Framework: Custom PHP with:

  • Database.php (SQLite handler)
  • AdminAuth.php (authentication)
  • RateLimit.php (custom rate limiting)
  • Parsedown.php (markdown parser - 52KB, could be memory-intensive)

Endpoints:

  • Public: index, search, thesis view (tfe.php), media, licenses
  • Admin: CRUD operations, import, logs, maintenance mode
  • File uploads: Media files and thesis PDFs

5. Log File Status

Nginx Access Logs:

  • Current: posterg_access.log (133KB)
  • Last rotation: March 25, 2026 15:47

Nginx Error Logs:

  • Current: posterg_error.log (234KB) ⚠️ LARGE SIZE
  • Previous: posterg_error.log.1 (732B - from Mar 25)

Critical: Error log grew from 732B to 234KB in ~1 day. This suggests recent error activity.


What I CANNOT Access (Requires Root/Sudo)

🔒 Blocked Investigations

1. Nginx Error Logs

Permission denied: /var/log/nginx/posterg_error.log

WHY CRITICAL: This 234KB error log likely contains the root cause. Typical error logs are <10KB.

Commands to run (as root):

# View recent errors before crash
sudo tail -1000 /var/log/nginx/posterg_error.log

# Check for PHP-FPM errors, memory exhaustion, timeouts
sudo grep -E "memory|exhausted|timeout|fatal|error" /var/log/nginx/posterg_error.log

# Look for patterns (repeated errors from specific IP/endpoint)
sudo awk '{print $1}' /var/log/nginx/posterg_error.log | sort | uniq -c | sort -rn | head -20

2. System Journal Logs

journalctl: Users in groups 'adm', 'systemd-journal' can see all messages

WHY CRITICAL: Contains kernel messages, OOM killer events, service crashes, and the exact crash reason.

Commands to run (as root):

# Check last boot messages for crash indicators
sudo journalctl -b -1 --no-pager | grep -E "Out of memory|OOM|killed|panic|segfault"

# View kernel messages around crash time
sudo journalctl -k -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00"

# Check for PHP-FPM/nginx crashes
sudo journalctl -u php8.4-fpm -b -1 --since "2026-03-24 11:00"
sudo journalctl -u nginx -b -1 --since "2026-03-24 11:00"

# Look for repeated service restarts
sudo journalctl -b -1 | grep -E "Started|Stopped|Failed" | tail -100

3. Kernel Messages (dmesg)

dmesg: Operation not permitted

WHY CRITICAL: Shows hardware errors, OOM kills, kernel panics, disk issues.

Commands to run (as root):

# Check for OOM killer activity
sudo dmesg -T | grep -i "out of memory"

# Check for hardware/disk errors
sudo dmesg -T | grep -i "error\|fail\|critical"

# Review last 200 kernel messages
sudo dmesg -T | tail -200

4. PHP-FPM Logs

Permission denied: /var/log/php8.4-fpm.log

WHY CRITICAL: Shows PHP memory exhaustion, fatal errors, slow requests.

Commands to run (as root):

# Check for PHP memory errors
sudo grep -E "memory|fatal|error|segfault" /var/log/php8.4-fpm.log*

# Look for slow request logs
sudo find /var/log -name "*php*slow*" -exec cat {} \;

5. System Logs Archive

Location: /var/log/journal/9a57a2432f96427a80e97d1d269e6a58/
Contains binary journal files from previous boots but not readable without root.


Hypotheses (Ranked by Likelihood)

1. 🔥 Memory Exhaustion / OOM Killer (HIGH PROBABILITY)

Evidence:

  • Large 100MB upload limit
  • Multiple PHP-FPM workers could accumulate
  • 234KB error log suggests many errors occurred
  • System became completely unresponsive (classic OOM symptom)

Attack Vectors:

  • Multiple concurrent large file uploads (thesis PDFs)
  • Search endpoint abuse despite rate limiting
  • SQLite database operations on large datasets
  • Parsedown.php processing large markdown files

How to Confirm:

# Check for OOM killer evidence
sudo journalctl -b -1 | grep -i "oom"
sudo dmesg -T | grep -i "killed process"
sudo grep "Out of memory" /var/log/syslog* 2>/dev/null

2. ⚠️ PHP-FPM Process Accumulation (MEDIUM PROBABILITY)

Evidence:

  • 120-second timeout allows long-running requests
  • Slow SQLite queries could pile up
  • If workers get stuck, new connections queue

How to Confirm:

# Check PHP-FPM configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -E "pm\.max_children|pm\.start_servers|pm\.min_spare|pm\.max_spare"

# Review PHP-FPM slow log
sudo cat /var/log/php8.4-fpm-slow.log 2>/dev/null

3. Database Lock Contention (MEDIUM PROBABILITY)

Evidence:

  • SQLite with multiple concurrent writers
  • Admin import operations + public searches simultaneously
  • SQLite has limited concurrency (write locks entire database)

How to Confirm:

# Check error logs for "database is locked" messages
sudo grep -i "database.*lock" /var/log/nginx/posterg_error.log

# Check SQLite journal files (abandoned transactions)
ls -la /var/www/posterg/storage/*.db-journal 2>/dev/null

4. 🌐 Brute Force / DDoS Attack (LOW-MEDIUM PROBABILITY)

Evidence:

  • Rate limiting exists but is permissive (30 req/min = 1 every 2 seconds)
  • Admin panel with HTTP Basic Auth (target for brute force)
  • Public search endpoint

How to Confirm:

# Check for attack patterns in access logs
sudo zcat /var/log/nginx/posterg_access.log*.gz | \
  awk '{print $1}' | sort | uniq -c | sort -rn | head -20

# Look for 401/403 patterns (brute force attempts)
sudo grep -E " (401|403) " /var/log/nginx/posterg_access.log* | \
  awk '{print $1}' | sort | uniq -c | sort -rn

# Check for high request rates
sudo awk '{print $4}' /var/log/nginx/posterg_access.log | cut -d: -f1-2 | \
  uniq -c | sort -rn | head -20

5. 🐛 Application Bug (LOW PROBABILITY)

Evidence:

  • Database.php recently updated (Mar 24 14:49)
  • 234KB error log indicates errors occurred

How to Confirm:

# Review nginx errors for PHP fatal errors
sudo grep "PHP Fatal" /var/log/nginx/posterg_error.log

# Check for infinite loops or memory leaks
sudo grep -E "Maximum execution time|memory limit" /var/log/nginx/posterg_error.log

Phase 1: Immediate Analysis (5 minutes)

# 1. Check the smoking gun - nginx error log
sudo tail -500 /var/log/nginx/posterg_error.log | less

# 2. Look for OOM killer
sudo journalctl -b -1 | grep -i "oom\|killed" | tail -50

# 3. Check journal around crash time
sudo journalctl -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00" | less

Phase 2: Deeper Analysis (15 minutes)

# 4. Export last boot journal to file for analysis
sudo journalctl -b -1 --no-pager > /tmp/last_boot_journal.log
chown theophile:theophile /tmp/last_boot_journal.log

# 5. Check PHP-FPM errors
sudo cat /var/log/php8.4-fpm.log* | grep -E "NOTICE|WARNING|ERROR"

# 6. Analyze access patterns before crash
sudo zcat /var/log/nginx/posterg_access.log*.gz 2>/dev/null | \
  awk '$4 >= "[24/Mar/2026:11:00:" && $4 <= "[24/Mar/2026:13:00:"' | \
  awk '{print $1}' | sort | uniq -c | sort -rn > /tmp/crash_access_analysis.txt

# 7. Check for database corruption
sqlite3 /var/www/posterg/storage/posterg.db "PRAGMA integrity_check;"

Phase 3: System Health Check (10 minutes)

# 8. Review PHP-FPM pool configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -v "^;" | grep -v "^$"

# 9. Check system resource limits
ulimit -a

# 10. Review systemd service limits
systemctl show php8.4-fpm | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
systemctl show nginx | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"

Preventive Measures to Implement

Immediate (Before Next Investigation)

  1. Add user to adm group for log access:

    sudo usermod -aG adm theophile
    sudo usermod -aG systemd-journal theophile
    
  2. Enable detailed error logging (temporarily):

    # In /etc/nginx/sites-available/posterg
    error_log /var/log/nginx/posterg_error.log debug;
    sudo systemctl reload nginx
    
  3. Enable PHP-FPM slow log:

    # In /etc/php/8.4/fpm/pool.d/www.conf
    slowlog = /var/log/php8.4-fpm-slow.log
    request_slowlog_timeout = 10s
    sudo systemctl restart php8.4-fpm
    

Short-term (This Week)

  1. Tighten rate limits in nginx config:

    limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m;  # Was 30r/m
    limit_req_zone $binary_remote_addr zone=search:10m rate=5r/m;    # Was 30r/m
    
  2. Add connection limits:

    limit_conn_zone $binary_remote_addr zone=addr:10m;
    limit_conn addr 10;  # Max 10 concurrent connections per IP
    
  3. Reduce PHP-FPM timeout:

    fastcgi_read_timeout 60;  # Was 120
    
  4. Monitor memory usage:

    # Add to crontab
    */5 * * * * free -m >> /var/log/memory-monitor.log
    

Long-term (This Month)

  1. Migrate from SQLite to PostgreSQL/MySQL for better concurrency
  2. Implement application-level logging (not just nginx/PHP-FPM)
  3. Add monitoring: Prometheus + Grafana or similar
  4. Configure log rotation more aggressively
  5. Set up automated alerts for high memory/CPU usage

Files to Review (When Root Access Available)

Priority 1 (Most Likely to Show Cause)

  • /var/log/nginx/posterg_error.log (234KB - abnormally large)
  • Journal logs for boot -1: journalctl -b -1
  • Kernel messages: dmesg -T

Priority 2 (Supporting Evidence)

  • /var/log/php8.4-fpm.log*
  • /var/log/nginx/posterg_access.log* (attack pattern analysis)
  • Systemd service logs: journalctl -u php8.4-fpm -b -1, journalctl -u nginx -b -1

Priority 3 (Configuration Review)

  • /etc/php/8.4/fpm/pool.d/www.conf (worker limits, timeouts)
  • /etc/security/limits.conf (system resource limits)
  • /etc/systemd/system/php8.4-fpm.service.d/ (service overrides)

Questions to Answer

  1. What filled the 234KB error log? (Compare to normal ~1KB size)
  2. Was there an OOM killer event? (Check journalctl and dmesg)
  3. What happened between March 2-24? (22-day boot gap is suspicious)
  4. Were there repeated service crashes/restarts? (Check systemd journals)
  5. What was the last request before the crash? (Check nginx access logs)
  6. Is there evidence of an attack? (IP analysis, rate limit hits)

Next Steps

For theophile (with sudo access):

  1. Run Phase 1 commands immediately
  2. Export journal logs to /tmp/ for detailed review
  3. Review nginx error log and identify patterns
  4. Share findings from logs to determine if application is at fault
  5. Implement immediate preventive measures (user to adm group, slow logging)

For automated monitoring (recommended):

  • Set up fail2ban for admin panel protection
  • Configure monit or similar for service health checks
  • Enable automatic log forwarding to external system (prevent data loss on crash)

Investigation Status: ⏸️ PAUSED - Awaiting root access to critical logs
Risk Level: 🔴 HIGH - Cause unknown, could recur anytime
Recommended Priority: 🚨 URGENT - Next crash could cause data loss