mirror of https://codeberg.org/PostERG/xamxam.git synced 2026-05-06 19:19:19 +02:00

Files

Théophile Gervreau-Mercier 5c5054d744 Investigating VM crash

2026-04-13 11:12:12 +02:00

13 KiB

Raw Blame History

VM Crash Investigation Report - posterg.erg.be

Date: 2026-03-26
Investigator: Automated Investigation (Limited Access)
Server: theophile@posterg.erg.be:3274

Executive Summary

The VM experienced an unresponsive state requiring a hard reboot on March 24, 2026 at ~12:56 UTC. Investigation was limited by lack of root/adm group access to critical system logs. Initial findings show no obvious application-level issues, but critical system logs require root access for complete analysis.

Timeline

Confirmed Events

Last known activity (previous boot): March 2, 2026 15:38:59 UTC
Gap period: March 2-24 (22 days) - NO BOOT LOGS AVAILABLE
System reboot: March 24, 2026 12:56-12:57 UTC
Current uptime: 2 days, 1 hour (as of investigation time)
Current system state: Stable, all services running normally

Critical Unknown

There is a 22-day gap in boot records between March 2 and March 24. This could indicate:

System was running continuously during this period and crashed on/before March 24
Multiple unrecorded reboots occurred
Journal corruption or rotation issues

What I Could Access (Non-Root Investigation)

✅ Successfully Checked

1. Current System Health

Memory: 7.8GB total, 464MB used, 5.9GB free (HEALTHY)
Disk: 30GB, 3.2GB used (12% usage - HEALTHY)
Swap: 976MB, 0B used (not being used)
Load Average: 0.00, 0.00, 0.00 (IDLE)

2. Running Services

nginx: 4 worker processes running normally
php-fpm: Master + 2 workers (PHP 8.4)
mariadb: Running (155MB RSS)
All services appear healthy with normal memory usage

3. Nginx Configuration Analysis

Location: /etc/nginx/sites-available/posterg

Security Measures Found:

Rate limiting configured:
- General requests: 30 req/min
- Search endpoint: 30 req/min (burst=10)
- Admin: 60 req/min (burst=20)
Client max body size: 100MB
Timeouts: 120 seconds (read/send)
HTTP Basic Auth on /admin/ directory

Potential Issues:

⚠️ Rate limits are relatively permissive (30 req/min could allow rapid resource consumption)
⚠️ Large upload size (100MB) combined with multiple concurrent uploads could exhaust memory
⚠️ 120-second timeouts on PHP processing could lead to worker process accumulation

4. Application Architecture

Type: PHP-based thesis repository
Database: SQLite (located in /var/www/posterg/storage/posterg.db)
Framework: Custom PHP with:

Database.php (SQLite handler)
AdminAuth.php (authentication)
RateLimit.php (custom rate limiting)
Parsedown.php (markdown parser - 52KB, could be memory-intensive)

Endpoints:

Public: index, search, thesis view (tfe.php), media, licenses
Admin: CRUD operations, import, logs, maintenance mode
File uploads: Media files and thesis PDFs

5. Log File Status

Nginx Access Logs:

Current: posterg_access.log (133KB)
Last rotation: March 25, 2026 15:47

Nginx Error Logs:

Current: posterg_error.log (234KB) ⚠️ LARGE SIZE
Previous: posterg_error.log.1 (732B - from Mar 25)

Critical: Error log grew from 732B to 234KB in ~1 day. This suggests recent error activity.

What I CANNOT Access (Requires Root/Sudo)

🔒 Blocked Investigations

1. Nginx Error Logs ❌

Permission denied: /var/log/nginx/posterg_error.log

WHY CRITICAL: This 234KB error log likely contains the root cause. Typical error logs are <10KB.

Commands to run (as root):

# View recent errors before crash
sudo tail -1000 /var/log/nginx/posterg_error.log

# Check for PHP-FPM errors, memory exhaustion, timeouts
sudo grep -E "memory|exhausted|timeout|fatal|error" /var/log/nginx/posterg_error.log

# Look for patterns (repeated errors from specific IP/endpoint)
sudo awk '{print $1}' /var/log/nginx/posterg_error.log | sort | uniq -c | sort -rn | head -20

2. System Journal Logs ❌

journalctl: Users in groups 'adm', 'systemd-journal' can see all messages

WHY CRITICAL: Contains kernel messages, OOM killer events, service crashes, and the exact crash reason.

Commands to run (as root):

# Check last boot messages for crash indicators
sudo journalctl -b -1 --no-pager | grep -E "Out of memory|OOM|killed|panic|segfault"

# View kernel messages around crash time
sudo journalctl -k -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00"

# Check for PHP-FPM/nginx crashes
sudo journalctl -u php8.4-fpm -b -1 --since "2026-03-24 11:00"
sudo journalctl -u nginx -b -1 --since "2026-03-24 11:00"

# Look for repeated service restarts
sudo journalctl -b -1 | grep -E "Started|Stopped|Failed" | tail -100

3. Kernel Messages (dmesg) ❌

dmesg: Operation not permitted

WHY CRITICAL: Shows hardware errors, OOM kills, kernel panics, disk issues.

Commands to run (as root):

# Check for OOM killer activity
sudo dmesg -T | grep -i "out of memory"

# Check for hardware/disk errors
sudo dmesg -T | grep -i "error\|fail\|critical"

# Review last 200 kernel messages
sudo dmesg -T | tail -200

4. PHP-FPM Logs ❌

Permission denied: /var/log/php8.4-fpm.log

WHY CRITICAL: Shows PHP memory exhaustion, fatal errors, slow requests.

Commands to run (as root):

# Check for PHP memory errors
sudo grep -E "memory|fatal|error|segfault" /var/log/php8.4-fpm.log*

# Look for slow request logs
sudo find /var/log -name "*php*slow*" -exec cat {} \;

5. System Logs Archive

Location: /var/log/journal/9a57a2432f96427a80e97d1d269e6a58/
Contains binary journal files from previous boots but not readable without root.

Hypotheses (Ranked by Likelihood)

1. 🔥 Memory Exhaustion / OOM Killer (HIGH PROBABILITY)

Evidence:

Large 100MB upload limit
Multiple PHP-FPM workers could accumulate
234KB error log suggests many errors occurred
System became completely unresponsive (classic OOM symptom)

Attack Vectors:

Multiple concurrent large file uploads (thesis PDFs)
Search endpoint abuse despite rate limiting
SQLite database operations on large datasets
Parsedown.php processing large markdown files

How to Confirm:

# Check for OOM killer evidence
sudo journalctl -b -1 | grep -i "oom"
sudo dmesg -T | grep -i "killed process"
sudo grep "Out of memory" /var/log/syslog* 2>/dev/null

2. ⚠️ PHP-FPM Process Accumulation (MEDIUM PROBABILITY)

Evidence:

120-second timeout allows long-running requests
Slow SQLite queries could pile up
If workers get stuck, new connections queue

How to Confirm:

# Check PHP-FPM configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -E "pm\.max_children|pm\.start_servers|pm\.min_spare|pm\.max_spare"

# Review PHP-FPM slow log
sudo cat /var/log/php8.4-fpm-slow.log 2>/dev/null

3. ⚡ Database Lock Contention (MEDIUM PROBABILITY)

Evidence:

SQLite with multiple concurrent writers
Admin import operations + public searches simultaneously
SQLite has limited concurrency (write locks entire database)

How to Confirm:

# Check error logs for "database is locked" messages
sudo grep -i "database.*lock" /var/log/nginx/posterg_error.log

# Check SQLite journal files (abandoned transactions)
ls -la /var/www/posterg/storage/*.db-journal 2>/dev/null

4. 🌐 Brute Force / DDoS Attack (LOW-MEDIUM PROBABILITY)

Evidence:

Rate limiting exists but is permissive (30 req/min = 1 every 2 seconds)
Admin panel with HTTP Basic Auth (target for brute force)
Public search endpoint

How to Confirm:

# Check for attack patterns in access logs
sudo zcat /var/log/nginx/posterg_access.log*.gz | \
  awk '{print $1}' | sort | uniq -c | sort -rn | head -20

# Look for 401/403 patterns (brute force attempts)
sudo grep -E " (401|403) " /var/log/nginx/posterg_access.log* | \
  awk '{print $1}' | sort | uniq -c | sort -rn

# Check for high request rates
sudo awk '{print $4}' /var/log/nginx/posterg_access.log | cut -d: -f1-2 | \
  uniq -c | sort -rn | head -20

5. 🐛 Application Bug (LOW PROBABILITY)

Evidence:

Database.php recently updated (Mar 24 14:49)
234KB error log indicates errors occurred

How to Confirm:

# Review nginx errors for PHP fatal errors
sudo grep "PHP Fatal" /var/log/nginx/posterg_error.log

# Check for infinite loops or memory leaks
sudo grep -E "Maximum execution time|memory limit" /var/log/nginx/posterg_error.log

Recommended Investigation Steps (For Root User)

Phase 1: Immediate Analysis (5 minutes)

# 1. Check the smoking gun - nginx error log
sudo tail -500 /var/log/nginx/posterg_error.log | less

# 2. Look for OOM killer
sudo journalctl -b -1 | grep -i "oom\|killed" | tail -50

# 3. Check journal around crash time
sudo journalctl -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00" | less

Phase 2: Deeper Analysis (15 minutes)

# 4. Export last boot journal to file for analysis
sudo journalctl -b -1 --no-pager > /tmp/last_boot_journal.log
chown theophile:theophile /tmp/last_boot_journal.log

# 5. Check PHP-FPM errors
sudo cat /var/log/php8.4-fpm.log* | grep -E "NOTICE|WARNING|ERROR"

# 6. Analyze access patterns before crash
sudo zcat /var/log/nginx/posterg_access.log*.gz 2>/dev/null | \
  awk '$4 >= "[24/Mar/2026:11:00:" && $4 <= "[24/Mar/2026:13:00:"' | \
  awk '{print $1}' | sort | uniq -c | sort -rn > /tmp/crash_access_analysis.txt

# 7. Check for database corruption
sqlite3 /var/www/posterg/storage/posterg.db "PRAGMA integrity_check;"

Phase 3: System Health Check (10 minutes)

# 8. Review PHP-FPM pool configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -v "^;" | grep -v "^$"

# 9. Check system resource limits
ulimit -a

# 10. Review systemd service limits
systemctl show php8.4-fpm | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
systemctl show nginx | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"

Preventive Measures to Implement

Immediate (Before Next Investigation)

Add user to adm group for log access:

sudo usermod -aG adm theophile
sudo usermod -aG systemd-journal theophile

Enable detailed error logging (temporarily):

# In /etc/nginx/sites-available/posterg
error_log /var/log/nginx/posterg_error.log debug;
sudo systemctl reload nginx

Enable PHP-FPM slow log:

# In /etc/php/8.4/fpm/pool.d/www.conf
slowlog = /var/log/php8.4-fpm-slow.log
request_slowlog_timeout = 10s
sudo systemctl restart php8.4-fpm

Short-term (This Week)

Tighten rate limits in nginx config:

limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m;  # Was 30r/m
limit_req_zone $binary_remote_addr zone=search:10m rate=5r/m;    # Was 30r/m

Add connection limits:

limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 10;  # Max 10 concurrent connections per IP

Reduce PHP-FPM timeout:
```
fastcgi_read_timeout 60;  # Was 120
```

Monitor memory usage:

# Add to crontab
*/5 * * * * free -m >> /var/log/memory-monitor.log

Long-term (This Month)

Migrate from SQLite to PostgreSQL/MySQL for better concurrency
Implement application-level logging (not just nginx/PHP-FPM)
Add monitoring: Prometheus + Grafana or similar
Configure log rotation more aggressively
Set up automated alerts for high memory/CPU usage

Files to Review (When Root Access Available)

Priority 1 (Most Likely to Show Cause)

/var/log/nginx/posterg_error.log (234KB - abnormally large)
Journal logs for boot -1: journalctl -b -1
Kernel messages: dmesg -T

Priority 2 (Supporting Evidence)

/var/log/php8.4-fpm.log*
/var/log/nginx/posterg_access.log* (attack pattern analysis)
Systemd service logs: journalctl -u php8.4-fpm -b -1, journalctl -u nginx -b -1

Priority 3 (Configuration Review)

/etc/php/8.4/fpm/pool.d/www.conf (worker limits, timeouts)
/etc/security/limits.conf (system resource limits)
/etc/systemd/system/php8.4-fpm.service.d/ (service overrides)

Questions to Answer

What filled the 234KB error log? (Compare to normal ~1KB size)
Was there an OOM killer event? (Check journalctl and dmesg)
What happened between March 2-24? (22-day boot gap is suspicious)
Were there repeated service crashes/restarts? (Check systemd journals)
What was the last request before the crash? (Check nginx access logs)
Is there evidence of an attack? (IP analysis, rate limit hits)

Next Steps

For theophile (with sudo access):

Run Phase 1 commands immediately
Export journal logs to /tmp/ for detailed review
Review nginx error log and identify patterns
Share findings from logs to determine if application is at fault
Implement immediate preventive measures (user to adm group, slow logging)

For automated monitoring (recommended):

Set up fail2ban for admin panel protection
Configure monit or similar for service health checks
Enable automatic log forwarding to external system (prevent data loss on crash)

Investigation Status: ⏸️ PAUSED - Awaiting root access to critical logs
Risk Level: 🔴 HIGH - Cause unknown, could recur anytime
Recommended Priority: 🚨 URGENT - Next crash could cause data loss

13 KiB Raw Blame History

VM Crash Investigation Report - posterg.erg.be

Executive Summary

Timeline

Confirmed Events

Critical Unknown

What I Could Access (Non-Root Investigation)

✅ Successfully Checked

1. Current System Health

2. Running Services

3. Nginx Configuration Analysis

4. Application Architecture

5. Log File Status

What I CANNOT Access (Requires Root/Sudo)

🔒 Blocked Investigations

1. Nginx Error Logs ❌

2. System Journal Logs ❌

3. Kernel Messages (dmesg) ❌

4. PHP-FPM Logs ❌

5. System Logs Archive

Hypotheses (Ranked by Likelihood)

1. 🔥 Memory Exhaustion / OOM Killer (HIGH PROBABILITY)

2. ⚠️ PHP-FPM Process Accumulation (MEDIUM PROBABILITY)

3. ⚡ Database Lock Contention (MEDIUM PROBABILITY)

4. 🌐 Brute Force / DDoS Attack (LOW-MEDIUM PROBABILITY)

5. 🐛 Application Bug (LOW PROBABILITY)

Recommended Investigation Steps (For Root User)

Phase 1: Immediate Analysis (5 minutes)

Phase 2: Deeper Analysis (15 minutes)

Phase 3: System Health Check (10 minutes)

Preventive Measures to Implement

Immediate (Before Next Investigation)

Short-term (This Week)

Long-term (This Month)

Files to Review (When Root Access Available)

Priority 1 (Most Likely to Show Cause)

Priority 2 (Supporting Evidence)

Priority 3 (Configuration Review)

Questions to Answer

Next Steps

13 KiB

Raw Blame History