13 KiB
VM Crash Investigation Report - posterg.erg.be
Date: 2026-03-26
Investigator: Automated Investigation (Limited Access)
Server: theophile@posterg.erg.be:3274
Executive Summary
The VM experienced an unresponsive state requiring a hard reboot on March 24, 2026 at ~12:56 UTC. Investigation was limited by lack of root/adm group access to critical system logs. Initial findings show no obvious application-level issues, but critical system logs require root access for complete analysis.
Timeline
Confirmed Events
- Last known activity (previous boot): March 2, 2026 15:38:59 UTC
- Gap period: March 2-24 (22 days) - NO BOOT LOGS AVAILABLE
- System reboot: March 24, 2026 12:56-12:57 UTC
- Current uptime: 2 days, 1 hour (as of investigation time)
- Current system state: Stable, all services running normally
Critical Unknown
There is a 22-day gap in boot records between March 2 and March 24. This could indicate:
- System was running continuously during this period and crashed on/before March 24
- Multiple unrecorded reboots occurred
- Journal corruption or rotation issues
What I Could Access (Non-Root Investigation)
✅ Successfully Checked
1. Current System Health
Memory: 7.8GB total, 464MB used, 5.9GB free (HEALTHY)
Disk: 30GB, 3.2GB used (12% usage - HEALTHY)
Swap: 976MB, 0B used (not being used)
Load Average: 0.00, 0.00, 0.00 (IDLE)
2. Running Services
- nginx: 4 worker processes running normally
- php-fpm: Master + 2 workers (PHP 8.4)
- mariadb: Running (155MB RSS)
- All services appear healthy with normal memory usage
3. Nginx Configuration Analysis
Location: /etc/nginx/sites-available/posterg
Security Measures Found:
- Rate limiting configured:
- General requests: 30 req/min
- Search endpoint: 30 req/min (burst=10)
- Admin: 60 req/min (burst=20)
- Client max body size: 100MB
- Timeouts: 120 seconds (read/send)
- HTTP Basic Auth on
/admin/directory
Potential Issues:
- ⚠️ Rate limits are relatively permissive (30 req/min could allow rapid resource consumption)
- ⚠️ Large upload size (100MB) combined with multiple concurrent uploads could exhaust memory
- ⚠️ 120-second timeouts on PHP processing could lead to worker process accumulation
4. Application Architecture
Type: PHP-based thesis repository
Database: SQLite (located in /var/www/posterg/storage/posterg.db)
Framework: Custom PHP with:
- Database.php (SQLite handler)
- AdminAuth.php (authentication)
- RateLimit.php (custom rate limiting)
- Parsedown.php (markdown parser - 52KB, could be memory-intensive)
Endpoints:
- Public: index, search, thesis view (tfe.php), media, licenses
- Admin: CRUD operations, import, logs, maintenance mode
- File uploads: Media files and thesis PDFs
5. Log File Status
Nginx Access Logs:
- Current:
posterg_access.log(133KB) - Last rotation: March 25, 2026 15:47
Nginx Error Logs:
- Current:
posterg_error.log(234KB) ⚠️ LARGE SIZE - Previous:
posterg_error.log.1(732B - from Mar 25)
Critical: Error log grew from 732B to 234KB in ~1 day. This suggests recent error activity.
What I CANNOT Access (Requires Root/Sudo)
🔒 Blocked Investigations
1. Nginx Error Logs ❌
Permission denied: /var/log/nginx/posterg_error.log
WHY CRITICAL: This 234KB error log likely contains the root cause. Typical error logs are <10KB.
Commands to run (as root):
# View recent errors before crash
sudo tail -1000 /var/log/nginx/posterg_error.log
# Check for PHP-FPM errors, memory exhaustion, timeouts
sudo grep -E "memory|exhausted|timeout|fatal|error" /var/log/nginx/posterg_error.log
# Look for patterns (repeated errors from specific IP/endpoint)
sudo awk '{print $1}' /var/log/nginx/posterg_error.log | sort | uniq -c | sort -rn | head -20
2. System Journal Logs ❌
journalctl: Users in groups 'adm', 'systemd-journal' can see all messages
WHY CRITICAL: Contains kernel messages, OOM killer events, service crashes, and the exact crash reason.
Commands to run (as root):
# Check last boot messages for crash indicators
sudo journalctl -b -1 --no-pager | grep -E "Out of memory|OOM|killed|panic|segfault"
# View kernel messages around crash time
sudo journalctl -k -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00"
# Check for PHP-FPM/nginx crashes
sudo journalctl -u php8.4-fpm -b -1 --since "2026-03-24 11:00"
sudo journalctl -u nginx -b -1 --since "2026-03-24 11:00"
# Look for repeated service restarts
sudo journalctl -b -1 | grep -E "Started|Stopped|Failed" | tail -100
3. Kernel Messages (dmesg) ❌
dmesg: Operation not permitted
WHY CRITICAL: Shows hardware errors, OOM kills, kernel panics, disk issues.
Commands to run (as root):
# Check for OOM killer activity
sudo dmesg -T | grep -i "out of memory"
# Check for hardware/disk errors
sudo dmesg -T | grep -i "error\|fail\|critical"
# Review last 200 kernel messages
sudo dmesg -T | tail -200
4. PHP-FPM Logs ❌
Permission denied: /var/log/php8.4-fpm.log
WHY CRITICAL: Shows PHP memory exhaustion, fatal errors, slow requests.
Commands to run (as root):
# Check for PHP memory errors
sudo grep -E "memory|fatal|error|segfault" /var/log/php8.4-fpm.log*
# Look for slow request logs
sudo find /var/log -name "*php*slow*" -exec cat {} \;
5. System Logs Archive
Location: /var/log/journal/9a57a2432f96427a80e97d1d269e6a58/
Contains binary journal files from previous boots but not readable without root.
Hypotheses (Ranked by Likelihood)
1. 🔥 Memory Exhaustion / OOM Killer (HIGH PROBABILITY)
Evidence:
- Large 100MB upload limit
- Multiple PHP-FPM workers could accumulate
- 234KB error log suggests many errors occurred
- System became completely unresponsive (classic OOM symptom)
Attack Vectors:
- Multiple concurrent large file uploads (thesis PDFs)
- Search endpoint abuse despite rate limiting
- SQLite database operations on large datasets
- Parsedown.php processing large markdown files
How to Confirm:
# Check for OOM killer evidence
sudo journalctl -b -1 | grep -i "oom"
sudo dmesg -T | grep -i "killed process"
sudo grep "Out of memory" /var/log/syslog* 2>/dev/null
2. ⚠️ PHP-FPM Process Accumulation (MEDIUM PROBABILITY)
Evidence:
- 120-second timeout allows long-running requests
- Slow SQLite queries could pile up
- If workers get stuck, new connections queue
How to Confirm:
# Check PHP-FPM configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -E "pm\.max_children|pm\.start_servers|pm\.min_spare|pm\.max_spare"
# Review PHP-FPM slow log
sudo cat /var/log/php8.4-fpm-slow.log 2>/dev/null
3. ⚡ Database Lock Contention (MEDIUM PROBABILITY)
Evidence:
- SQLite with multiple concurrent writers
- Admin import operations + public searches simultaneously
- SQLite has limited concurrency (write locks entire database)
How to Confirm:
# Check error logs for "database is locked" messages
sudo grep -i "database.*lock" /var/log/nginx/posterg_error.log
# Check SQLite journal files (abandoned transactions)
ls -la /var/www/posterg/storage/*.db-journal 2>/dev/null
4. 🌐 Brute Force / DDoS Attack (LOW-MEDIUM PROBABILITY)
Evidence:
- Rate limiting exists but is permissive (30 req/min = 1 every 2 seconds)
- Admin panel with HTTP Basic Auth (target for brute force)
- Public search endpoint
How to Confirm:
# Check for attack patterns in access logs
sudo zcat /var/log/nginx/posterg_access.log*.gz | \
awk '{print $1}' | sort | uniq -c | sort -rn | head -20
# Look for 401/403 patterns (brute force attempts)
sudo grep -E " (401|403) " /var/log/nginx/posterg_access.log* | \
awk '{print $1}' | sort | uniq -c | sort -rn
# Check for high request rates
sudo awk '{print $4}' /var/log/nginx/posterg_access.log | cut -d: -f1-2 | \
uniq -c | sort -rn | head -20
5. 🐛 Application Bug (LOW PROBABILITY)
Evidence:
- Database.php recently updated (Mar 24 14:49)
- 234KB error log indicates errors occurred
How to Confirm:
# Review nginx errors for PHP fatal errors
sudo grep "PHP Fatal" /var/log/nginx/posterg_error.log
# Check for infinite loops or memory leaks
sudo grep -E "Maximum execution time|memory limit" /var/log/nginx/posterg_error.log
Recommended Investigation Steps (For Root User)
Phase 1: Immediate Analysis (5 minutes)
# 1. Check the smoking gun - nginx error log
sudo tail -500 /var/log/nginx/posterg_error.log | less
# 2. Look for OOM killer
sudo journalctl -b -1 | grep -i "oom\|killed" | tail -50
# 3. Check journal around crash time
sudo journalctl -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00" | less
Phase 2: Deeper Analysis (15 minutes)
# 4. Export last boot journal to file for analysis
sudo journalctl -b -1 --no-pager > /tmp/last_boot_journal.log
chown theophile:theophile /tmp/last_boot_journal.log
# 5. Check PHP-FPM errors
sudo cat /var/log/php8.4-fpm.log* | grep -E "NOTICE|WARNING|ERROR"
# 6. Analyze access patterns before crash
sudo zcat /var/log/nginx/posterg_access.log*.gz 2>/dev/null | \
awk '$4 >= "[24/Mar/2026:11:00:" && $4 <= "[24/Mar/2026:13:00:"' | \
awk '{print $1}' | sort | uniq -c | sort -rn > /tmp/crash_access_analysis.txt
# 7. Check for database corruption
sqlite3 /var/www/posterg/storage/posterg.db "PRAGMA integrity_check;"
Phase 3: System Health Check (10 minutes)
# 8. Review PHP-FPM pool configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -v "^;" | grep -v "^$"
# 9. Check system resource limits
ulimit -a
# 10. Review systemd service limits
systemctl show php8.4-fpm | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
systemctl show nginx | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
Preventive Measures to Implement
Immediate (Before Next Investigation)
-
Add user to adm group for log access:
sudo usermod -aG adm theophile sudo usermod -aG systemd-journal theophile -
Enable detailed error logging (temporarily):
# In /etc/nginx/sites-available/posterg error_log /var/log/nginx/posterg_error.log debug; sudo systemctl reload nginx -
Enable PHP-FPM slow log:
# In /etc/php/8.4/fpm/pool.d/www.conf slowlog = /var/log/php8.4-fpm-slow.log request_slowlog_timeout = 10s sudo systemctl restart php8.4-fpm
Short-term (This Week)
-
Tighten rate limits in nginx config:
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m; # Was 30r/m limit_req_zone $binary_remote_addr zone=search:10m rate=5r/m; # Was 30r/m -
Add connection limits:
limit_conn_zone $binary_remote_addr zone=addr:10m; limit_conn addr 10; # Max 10 concurrent connections per IP -
Reduce PHP-FPM timeout:
fastcgi_read_timeout 60; # Was 120 -
Monitor memory usage:
# Add to crontab */5 * * * * free -m >> /var/log/memory-monitor.log
Long-term (This Month)
- Migrate from SQLite to PostgreSQL/MySQL for better concurrency
- Implement application-level logging (not just nginx/PHP-FPM)
- Add monitoring: Prometheus + Grafana or similar
- Configure log rotation more aggressively
- Set up automated alerts for high memory/CPU usage
Files to Review (When Root Access Available)
Priority 1 (Most Likely to Show Cause)
/var/log/nginx/posterg_error.log(234KB - abnormally large)- Journal logs for boot -1:
journalctl -b -1 - Kernel messages:
dmesg -T
Priority 2 (Supporting Evidence)
/var/log/php8.4-fpm.log*/var/log/nginx/posterg_access.log*(attack pattern analysis)- Systemd service logs:
journalctl -u php8.4-fpm -b -1,journalctl -u nginx -b -1
Priority 3 (Configuration Review)
/etc/php/8.4/fpm/pool.d/www.conf(worker limits, timeouts)/etc/security/limits.conf(system resource limits)/etc/systemd/system/php8.4-fpm.service.d/(service overrides)
Questions to Answer
- What filled the 234KB error log? (Compare to normal ~1KB size)
- Was there an OOM killer event? (Check journalctl and dmesg)
- What happened between March 2-24? (22-day boot gap is suspicious)
- Were there repeated service crashes/restarts? (Check systemd journals)
- What was the last request before the crash? (Check nginx access logs)
- Is there evidence of an attack? (IP analysis, rate limit hits)
Next Steps
For theophile (with sudo access):
- Run Phase 1 commands immediately
- Export journal logs to
/tmp/for detailed review - Review nginx error log and identify patterns
- Share findings from logs to determine if application is at fault
- Implement immediate preventive measures (user to adm group, slow logging)
For automated monitoring (recommended):
- Set up
fail2banfor admin panel protection - Configure
monitor similar for service health checks - Enable automatic log forwarding to external system (prevent data loss on crash)
Investigation Status: ⏸️ PAUSED - Awaiting root access to critical logs
Risk Level: 🔴 HIGH - Cause unknown, could recur anytime
Recommended Priority: 🚨 URGENT - Next crash could cause data loss