Files
xamxam/docs/VM_Crash_Reports.md
Théophile Gervreau-Mercier 5c5054d744 Investigating VM crash
2026-04-13 11:12:12 +02:00

417 lines
13 KiB
Markdown

# VM Crash Investigation Report - posterg.erg.be
**Date:** 2026-03-26
**Investigator:** Automated Investigation (Limited Access)
**Server:** theophile@posterg.erg.be:3274
## Executive Summary
The VM experienced an unresponsive state requiring a hard reboot on **March 24, 2026 at ~12:56 UTC**. Investigation was limited by lack of root/adm group access to critical system logs. Initial findings show no obvious application-level issues, but **critical system logs require root access for complete analysis**.
---
## Timeline
### Confirmed Events
- **Last known activity (previous boot):** March 2, 2026 15:38:59 UTC
- **Gap period:** March 2-24 (22 days) - **NO BOOT LOGS AVAILABLE**
- **System reboot:** March 24, 2026 12:56-12:57 UTC
- **Current uptime:** 2 days, 1 hour (as of investigation time)
- **Current system state:** Stable, all services running normally
### Critical Unknown
**There is a 22-day gap in boot records** between March 2 and March 24. This could indicate:
1. System was running continuously during this period and crashed on/before March 24
2. Multiple unrecorded reboots occurred
3. Journal corruption or rotation issues
---
## What I Could Access (Non-Root Investigation)
### ✅ Successfully Checked
#### 1. Current System Health
```
Memory: 7.8GB total, 464MB used, 5.9GB free (HEALTHY)
Disk: 30GB, 3.2GB used (12% usage - HEALTHY)
Swap: 976MB, 0B used (not being used)
Load Average: 0.00, 0.00, 0.00 (IDLE)
```
#### 2. Running Services
- **nginx:** 4 worker processes running normally
- **php-fpm:** Master + 2 workers (PHP 8.4)
- **mariadb:** Running (155MB RSS)
- All services appear healthy with normal memory usage
#### 3. Nginx Configuration Analysis
**Location:** `/etc/nginx/sites-available/posterg`
**Security Measures Found:**
- Rate limiting configured:
- General requests: 30 req/min
- Search endpoint: 30 req/min (burst=10)
- Admin: 60 req/min (burst=20)
- Client max body size: 100MB
- Timeouts: 120 seconds (read/send)
- HTTP Basic Auth on `/admin/` directory
**Potential Issues:**
- ⚠️ Rate limits are relatively **permissive** (30 req/min could allow rapid resource consumption)
- ⚠️ Large upload size (100MB) combined with multiple concurrent uploads could **exhaust memory**
- ⚠️ 120-second timeouts on PHP processing could lead to **worker process accumulation**
#### 4. Application Architecture
**Type:** PHP-based thesis repository
**Database:** SQLite (located in `/var/www/posterg/storage/posterg.db`)
**Framework:** Custom PHP with:
- Database.php (SQLite handler)
- AdminAuth.php (authentication)
- RateLimit.php (custom rate limiting)
- Parsedown.php (markdown parser - 52KB, could be memory-intensive)
**Endpoints:**
- Public: index, search, thesis view (tfe.php), media, licenses
- Admin: CRUD operations, import, logs, maintenance mode
- File uploads: Media files and thesis PDFs
#### 5. Log File Status
**Nginx Access Logs:**
- Current: `posterg_access.log` (133KB)
- Last rotation: March 25, 2026 15:47
**Nginx Error Logs:**
- Current: `posterg_error.log` (234KB) ⚠️ **LARGE SIZE**
- Previous: `posterg_error.log.1` (732B - from Mar 25)
**Critical:** Error log grew from 732B to 234KB in ~1 day. **This suggests recent error activity.**
---
## What I CANNOT Access (Requires Root/Sudo)
### 🔒 Blocked Investigations
#### 1. **Nginx Error Logs** ❌
```bash
Permission denied: /var/log/nginx/posterg_error.log
```
**WHY CRITICAL:** This 234KB error log likely contains the root cause. Typical error logs are <10KB.
**Commands to run (as root):**
```bash
# View recent errors before crash
sudo tail -1000 /var/log/nginx/posterg_error.log
# Check for PHP-FPM errors, memory exhaustion, timeouts
sudo grep -E "memory|exhausted|timeout|fatal|error" /var/log/nginx/posterg_error.log
# Look for patterns (repeated errors from specific IP/endpoint)
sudo awk '{print $1}' /var/log/nginx/posterg_error.log | sort | uniq -c | sort -rn | head -20
```
#### 2. **System Journal Logs** ❌
```bash
journalctl: Users in groups 'adm', 'systemd-journal' can see all messages
```
**WHY CRITICAL:** Contains kernel messages, OOM killer events, service crashes, and the exact crash reason.
**Commands to run (as root):**
```bash
# Check last boot messages for crash indicators
sudo journalctl -b -1 --no-pager | grep -E "Out of memory|OOM|killed|panic|segfault"
# View kernel messages around crash time
sudo journalctl -k -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00"
# Check for PHP-FPM/nginx crashes
sudo journalctl -u php8.4-fpm -b -1 --since "2026-03-24 11:00"
sudo journalctl -u nginx -b -1 --since "2026-03-24 11:00"
# Look for repeated service restarts
sudo journalctl -b -1 | grep -E "Started|Stopped|Failed" | tail -100
```
#### 3. **Kernel Messages (dmesg)** ❌
```bash
dmesg: Operation not permitted
```
**WHY CRITICAL:** Shows hardware errors, OOM kills, kernel panics, disk issues.
**Commands to run (as root):**
```bash
# Check for OOM killer activity
sudo dmesg -T | grep -i "out of memory"
# Check for hardware/disk errors
sudo dmesg -T | grep -i "error\|fail\|critical"
# Review last 200 kernel messages
sudo dmesg -T | tail -200
```
#### 4. **PHP-FPM Logs** ❌
```bash
Permission denied: /var/log/php8.4-fpm.log
```
**WHY CRITICAL:** Shows PHP memory exhaustion, fatal errors, slow requests.
**Commands to run (as root):**
```bash
# Check for PHP memory errors
sudo grep -E "memory|fatal|error|segfault" /var/log/php8.4-fpm.log*
# Look for slow request logs
sudo find /var/log -name "*php*slow*" -exec cat {} \;
```
#### 5. **System Logs Archive**
**Location:** `/var/log/journal/9a57a2432f96427a80e97d1d269e6a58/`
Contains binary journal files from previous boots but **not readable without root**.
---
## Hypotheses (Ranked by Likelihood)
### 1. 🔥 **Memory Exhaustion / OOM Killer** (HIGH PROBABILITY)
**Evidence:**
- Large 100MB upload limit
- Multiple PHP-FPM workers could accumulate
- 234KB error log suggests many errors occurred
- System became completely unresponsive (classic OOM symptom)
**Attack Vectors:**
- Multiple concurrent large file uploads (thesis PDFs)
- Search endpoint abuse despite rate limiting
- SQLite database operations on large datasets
- Parsedown.php processing large markdown files
**How to Confirm:**
```bash
# Check for OOM killer evidence
sudo journalctl -b -1 | grep -i "oom"
sudo dmesg -T | grep -i "killed process"
sudo grep "Out of memory" /var/log/syslog* 2>/dev/null
```
### 2. ⚠️ **PHP-FPM Process Accumulation** (MEDIUM PROBABILITY)
**Evidence:**
- 120-second timeout allows long-running requests
- Slow SQLite queries could pile up
- If workers get stuck, new connections queue
**How to Confirm:**
```bash
# Check PHP-FPM configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -E "pm\.max_children|pm\.start_servers|pm\.min_spare|pm\.max_spare"
# Review PHP-FPM slow log
sudo cat /var/log/php8.4-fpm-slow.log 2>/dev/null
```
### 3. ⚡ **Database Lock Contention** (MEDIUM PROBABILITY)
**Evidence:**
- SQLite with multiple concurrent writers
- Admin import operations + public searches simultaneously
- SQLite has limited concurrency (write locks entire database)
**How to Confirm:**
```bash
# Check error logs for "database is locked" messages
sudo grep -i "database.*lock" /var/log/nginx/posterg_error.log
# Check SQLite journal files (abandoned transactions)
ls -la /var/www/posterg/storage/*.db-journal 2>/dev/null
```
### 4. 🌐 **Brute Force / DDoS Attack** (LOW-MEDIUM PROBABILITY)
**Evidence:**
- Rate limiting exists but is permissive (30 req/min = 1 every 2 seconds)
- Admin panel with HTTP Basic Auth (target for brute force)
- Public search endpoint
**How to Confirm:**
```bash
# Check for attack patterns in access logs
sudo zcat /var/log/nginx/posterg_access.log*.gz | \
awk '{print $1}' | sort | uniq -c | sort -rn | head -20
# Look for 401/403 patterns (brute force attempts)
sudo grep -E " (401|403) " /var/log/nginx/posterg_access.log* | \
awk '{print $1}' | sort | uniq -c | sort -rn
# Check for high request rates
sudo awk '{print $4}' /var/log/nginx/posterg_access.log | cut -d: -f1-2 | \
uniq -c | sort -rn | head -20
```
### 5. 🐛 **Application Bug** (LOW PROBABILITY)
**Evidence:**
- Database.php recently updated (Mar 24 14:49)
- 234KB error log indicates errors occurred
**How to Confirm:**
```bash
# Review nginx errors for PHP fatal errors
sudo grep "PHP Fatal" /var/log/nginx/posterg_error.log
# Check for infinite loops or memory leaks
sudo grep -E "Maximum execution time|memory limit" /var/log/nginx/posterg_error.log
```
---
## Recommended Investigation Steps (For Root User)
### Phase 1: Immediate Analysis (5 minutes)
```bash
# 1. Check the smoking gun - nginx error log
sudo tail -500 /var/log/nginx/posterg_error.log | less
# 2. Look for OOM killer
sudo journalctl -b -1 | grep -i "oom\|killed" | tail -50
# 3. Check journal around crash time
sudo journalctl -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00" | less
```
### Phase 2: Deeper Analysis (15 minutes)
```bash
# 4. Export last boot journal to file for analysis
sudo journalctl -b -1 --no-pager > /tmp/last_boot_journal.log
chown theophile:theophile /tmp/last_boot_journal.log
# 5. Check PHP-FPM errors
sudo cat /var/log/php8.4-fpm.log* | grep -E "NOTICE|WARNING|ERROR"
# 6. Analyze access patterns before crash
sudo zcat /var/log/nginx/posterg_access.log*.gz 2>/dev/null | \
awk '$4 >= "[24/Mar/2026:11:00:" && $4 <= "[24/Mar/2026:13:00:"' | \
awk '{print $1}' | sort | uniq -c | sort -rn > /tmp/crash_access_analysis.txt
# 7. Check for database corruption
sqlite3 /var/www/posterg/storage/posterg.db "PRAGMA integrity_check;"
```
### Phase 3: System Health Check (10 minutes)
```bash
# 8. Review PHP-FPM pool configuration
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -v "^;" | grep -v "^$"
# 9. Check system resource limits
ulimit -a
# 10. Review systemd service limits
systemctl show php8.4-fpm | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
systemctl show nginx | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
```
---
## Preventive Measures to Implement
### Immediate (Before Next Investigation)
1. **Add user to adm group** for log access:
```bash
sudo usermod -aG adm theophile
sudo usermod -aG systemd-journal theophile
```
2. **Enable detailed error logging** (temporarily):
```bash
# In /etc/nginx/sites-available/posterg
error_log /var/log/nginx/posterg_error.log debug;
sudo systemctl reload nginx
```
3. **Enable PHP-FPM slow log:**
```bash
# In /etc/php/8.4/fpm/pool.d/www.conf
slowlog = /var/log/php8.4-fpm-slow.log
request_slowlog_timeout = 10s
sudo systemctl restart php8.4-fpm
```
### Short-term (This Week)
1. **Tighten rate limits** in nginx config:
```nginx
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m; # Was 30r/m
limit_req_zone $binary_remote_addr zone=search:10m rate=5r/m; # Was 30r/m
```
2. **Add connection limits:**
```nginx
limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 10; # Max 10 concurrent connections per IP
```
3. **Reduce PHP-FPM timeout:**
```nginx
fastcgi_read_timeout 60; # Was 120
```
4. **Monitor memory usage:**
```bash
# Add to crontab
*/5 * * * * free -m >> /var/log/memory-monitor.log
```
### Long-term (This Month)
1. **Migrate from SQLite to PostgreSQL/MySQL** for better concurrency
2. **Implement application-level logging** (not just nginx/PHP-FPM)
3. **Add monitoring:** Prometheus + Grafana or similar
4. **Configure log rotation** more aggressively
5. **Set up automated alerts** for high memory/CPU usage
---
## Files to Review (When Root Access Available)
### Priority 1 (Most Likely to Show Cause)
- [ ] `/var/log/nginx/posterg_error.log` (234KB - abnormally large)
- [ ] Journal logs for boot -1: `journalctl -b -1`
- [ ] Kernel messages: `dmesg -T`
### Priority 2 (Supporting Evidence)
- [ ] `/var/log/php8.4-fpm.log*`
- [ ] `/var/log/nginx/posterg_access.log*` (attack pattern analysis)
- [ ] Systemd service logs: `journalctl -u php8.4-fpm -b -1`, `journalctl -u nginx -b -1`
### Priority 3 (Configuration Review)
- [ ] `/etc/php/8.4/fpm/pool.d/www.conf` (worker limits, timeouts)
- [ ] `/etc/security/limits.conf` (system resource limits)
- [ ] `/etc/systemd/system/php8.4-fpm.service.d/` (service overrides)
---
## Questions to Answer
1. **What filled the 234KB error log?** (Compare to normal ~1KB size)
2. **Was there an OOM killer event?** (Check journalctl and dmesg)
3. **What happened between March 2-24?** (22-day boot gap is suspicious)
4. **Were there repeated service crashes/restarts?** (Check systemd journals)
5. **What was the last request before the crash?** (Check nginx access logs)
6. **Is there evidence of an attack?** (IP analysis, rate limit hits)
---
## Next Steps
**For theophile (with sudo access):**
1. Run Phase 1 commands immediately
2. Export journal logs to `/tmp/` for detailed review
3. Review nginx error log and identify patterns
4. Share findings from logs to determine if application is at fault
5. Implement immediate preventive measures (user to adm group, slow logging)
**For automated monitoring (recommended):**
- Set up `fail2ban` for admin panel protection
- Configure `monit` or similar for service health checks
- Enable automatic log forwarding to external system (prevent data loss on crash)
---
**Investigation Status:** ⏸️ PAUSED - Awaiting root access to critical logs
**Risk Level:** 🔴 HIGH - Cause unknown, could recur anytime
**Recommended Priority:** 🚨 URGENT - Next crash could cause data loss