mirror of
https://codeberg.org/PostERG/xamxam.git
synced 2026-05-07 03:29:19 +02:00
417 lines
13 KiB
Markdown
417 lines
13 KiB
Markdown
# VM Crash Investigation Report - posterg.erg.be
|
|
**Date:** 2026-03-26
|
|
**Investigator:** Automated Investigation (Limited Access)
|
|
**Server:** theophile@posterg.erg.be:3274
|
|
|
|
## Executive Summary
|
|
The VM experienced an unresponsive state requiring a hard reboot on **March 24, 2026 at ~12:56 UTC**. Investigation was limited by lack of root/adm group access to critical system logs. Initial findings show no obvious application-level issues, but **critical system logs require root access for complete analysis**.
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
### Confirmed Events
|
|
- **Last known activity (previous boot):** March 2, 2026 15:38:59 UTC
|
|
- **Gap period:** March 2-24 (22 days) - **NO BOOT LOGS AVAILABLE**
|
|
- **System reboot:** March 24, 2026 12:56-12:57 UTC
|
|
- **Current uptime:** 2 days, 1 hour (as of investigation time)
|
|
- **Current system state:** Stable, all services running normally
|
|
|
|
### Critical Unknown
|
|
**There is a 22-day gap in boot records** between March 2 and March 24. This could indicate:
|
|
1. System was running continuously during this period and crashed on/before March 24
|
|
2. Multiple unrecorded reboots occurred
|
|
3. Journal corruption or rotation issues
|
|
|
|
---
|
|
|
|
## What I Could Access (Non-Root Investigation)
|
|
|
|
### ✅ Successfully Checked
|
|
|
|
#### 1. Current System Health
|
|
```
|
|
Memory: 7.8GB total, 464MB used, 5.9GB free (HEALTHY)
|
|
Disk: 30GB, 3.2GB used (12% usage - HEALTHY)
|
|
Swap: 976MB, 0B used (not being used)
|
|
Load Average: 0.00, 0.00, 0.00 (IDLE)
|
|
```
|
|
|
|
#### 2. Running Services
|
|
- **nginx:** 4 worker processes running normally
|
|
- **php-fpm:** Master + 2 workers (PHP 8.4)
|
|
- **mariadb:** Running (155MB RSS)
|
|
- All services appear healthy with normal memory usage
|
|
|
|
#### 3. Nginx Configuration Analysis
|
|
**Location:** `/etc/nginx/sites-available/posterg`
|
|
|
|
**Security Measures Found:**
|
|
- Rate limiting configured:
|
|
- General requests: 30 req/min
|
|
- Search endpoint: 30 req/min (burst=10)
|
|
- Admin: 60 req/min (burst=20)
|
|
- Client max body size: 100MB
|
|
- Timeouts: 120 seconds (read/send)
|
|
- HTTP Basic Auth on `/admin/` directory
|
|
|
|
**Potential Issues:**
|
|
- ⚠️ Rate limits are relatively **permissive** (30 req/min could allow rapid resource consumption)
|
|
- ⚠️ Large upload size (100MB) combined with multiple concurrent uploads could **exhaust memory**
|
|
- ⚠️ 120-second timeouts on PHP processing could lead to **worker process accumulation**
|
|
|
|
#### 4. Application Architecture
|
|
**Type:** PHP-based thesis repository
|
|
**Database:** SQLite (located in `/var/www/posterg/storage/posterg.db`)
|
|
**Framework:** Custom PHP with:
|
|
- Database.php (SQLite handler)
|
|
- AdminAuth.php (authentication)
|
|
- RateLimit.php (custom rate limiting)
|
|
- Parsedown.php (markdown parser - 52KB, could be memory-intensive)
|
|
|
|
**Endpoints:**
|
|
- Public: index, search, thesis view (tfe.php), media, licenses
|
|
- Admin: CRUD operations, import, logs, maintenance mode
|
|
- File uploads: Media files and thesis PDFs
|
|
|
|
#### 5. Log File Status
|
|
**Nginx Access Logs:**
|
|
- Current: `posterg_access.log` (133KB)
|
|
- Last rotation: March 25, 2026 15:47
|
|
|
|
**Nginx Error Logs:**
|
|
- Current: `posterg_error.log` (234KB) ⚠️ **LARGE SIZE**
|
|
- Previous: `posterg_error.log.1` (732B - from Mar 25)
|
|
|
|
**Critical:** Error log grew from 732B to 234KB in ~1 day. **This suggests recent error activity.**
|
|
|
|
---
|
|
|
|
## What I CANNOT Access (Requires Root/Sudo)
|
|
|
|
### 🔒 Blocked Investigations
|
|
|
|
#### 1. **Nginx Error Logs** ❌
|
|
```bash
|
|
Permission denied: /var/log/nginx/posterg_error.log
|
|
```
|
|
**WHY CRITICAL:** This 234KB error log likely contains the root cause. Typical error logs are <10KB.
|
|
|
|
**Commands to run (as root):**
|
|
```bash
|
|
# View recent errors before crash
|
|
sudo tail -1000 /var/log/nginx/posterg_error.log
|
|
|
|
# Check for PHP-FPM errors, memory exhaustion, timeouts
|
|
sudo grep -E "memory|exhausted|timeout|fatal|error" /var/log/nginx/posterg_error.log
|
|
|
|
# Look for patterns (repeated errors from specific IP/endpoint)
|
|
sudo awk '{print $1}' /var/log/nginx/posterg_error.log | sort | uniq -c | sort -rn | head -20
|
|
```
|
|
|
|
#### 2. **System Journal Logs** ❌
|
|
```bash
|
|
journalctl: Users in groups 'adm', 'systemd-journal' can see all messages
|
|
```
|
|
**WHY CRITICAL:** Contains kernel messages, OOM killer events, service crashes, and the exact crash reason.
|
|
|
|
**Commands to run (as root):**
|
|
```bash
|
|
# Check last boot messages for crash indicators
|
|
sudo journalctl -b -1 --no-pager | grep -E "Out of memory|OOM|killed|panic|segfault"
|
|
|
|
# View kernel messages around crash time
|
|
sudo journalctl -k -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00"
|
|
|
|
# Check for PHP-FPM/nginx crashes
|
|
sudo journalctl -u php8.4-fpm -b -1 --since "2026-03-24 11:00"
|
|
sudo journalctl -u nginx -b -1 --since "2026-03-24 11:00"
|
|
|
|
# Look for repeated service restarts
|
|
sudo journalctl -b -1 | grep -E "Started|Stopped|Failed" | tail -100
|
|
```
|
|
|
|
#### 3. **Kernel Messages (dmesg)** ❌
|
|
```bash
|
|
dmesg: Operation not permitted
|
|
```
|
|
**WHY CRITICAL:** Shows hardware errors, OOM kills, kernel panics, disk issues.
|
|
|
|
**Commands to run (as root):**
|
|
```bash
|
|
# Check for OOM killer activity
|
|
sudo dmesg -T | grep -i "out of memory"
|
|
|
|
# Check for hardware/disk errors
|
|
sudo dmesg -T | grep -i "error\|fail\|critical"
|
|
|
|
# Review last 200 kernel messages
|
|
sudo dmesg -T | tail -200
|
|
```
|
|
|
|
#### 4. **PHP-FPM Logs** ❌
|
|
```bash
|
|
Permission denied: /var/log/php8.4-fpm.log
|
|
```
|
|
**WHY CRITICAL:** Shows PHP memory exhaustion, fatal errors, slow requests.
|
|
|
|
**Commands to run (as root):**
|
|
```bash
|
|
# Check for PHP memory errors
|
|
sudo grep -E "memory|fatal|error|segfault" /var/log/php8.4-fpm.log*
|
|
|
|
# Look for slow request logs
|
|
sudo find /var/log -name "*php*slow*" -exec cat {} \;
|
|
```
|
|
|
|
#### 5. **System Logs Archive**
|
|
**Location:** `/var/log/journal/9a57a2432f96427a80e97d1d269e6a58/`
|
|
Contains binary journal files from previous boots but **not readable without root**.
|
|
|
|
---
|
|
|
|
## Hypotheses (Ranked by Likelihood)
|
|
|
|
### 1. 🔥 **Memory Exhaustion / OOM Killer** (HIGH PROBABILITY)
|
|
**Evidence:**
|
|
- Large 100MB upload limit
|
|
- Multiple PHP-FPM workers could accumulate
|
|
- 234KB error log suggests many errors occurred
|
|
- System became completely unresponsive (classic OOM symptom)
|
|
|
|
**Attack Vectors:**
|
|
- Multiple concurrent large file uploads (thesis PDFs)
|
|
- Search endpoint abuse despite rate limiting
|
|
- SQLite database operations on large datasets
|
|
- Parsedown.php processing large markdown files
|
|
|
|
**How to Confirm:**
|
|
```bash
|
|
# Check for OOM killer evidence
|
|
sudo journalctl -b -1 | grep -i "oom"
|
|
sudo dmesg -T | grep -i "killed process"
|
|
sudo grep "Out of memory" /var/log/syslog* 2>/dev/null
|
|
```
|
|
|
|
### 2. ⚠️ **PHP-FPM Process Accumulation** (MEDIUM PROBABILITY)
|
|
**Evidence:**
|
|
- 120-second timeout allows long-running requests
|
|
- Slow SQLite queries could pile up
|
|
- If workers get stuck, new connections queue
|
|
|
|
**How to Confirm:**
|
|
```bash
|
|
# Check PHP-FPM configuration
|
|
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -E "pm\.max_children|pm\.start_servers|pm\.min_spare|pm\.max_spare"
|
|
|
|
# Review PHP-FPM slow log
|
|
sudo cat /var/log/php8.4-fpm-slow.log 2>/dev/null
|
|
```
|
|
|
|
### 3. ⚡ **Database Lock Contention** (MEDIUM PROBABILITY)
|
|
**Evidence:**
|
|
- SQLite with multiple concurrent writers
|
|
- Admin import operations + public searches simultaneously
|
|
- SQLite has limited concurrency (write locks entire database)
|
|
|
|
**How to Confirm:**
|
|
```bash
|
|
# Check error logs for "database is locked" messages
|
|
sudo grep -i "database.*lock" /var/log/nginx/posterg_error.log
|
|
|
|
# Check SQLite journal files (abandoned transactions)
|
|
ls -la /var/www/posterg/storage/*.db-journal 2>/dev/null
|
|
```
|
|
|
|
### 4. 🌐 **Brute Force / DDoS Attack** (LOW-MEDIUM PROBABILITY)
|
|
**Evidence:**
|
|
- Rate limiting exists but is permissive (30 req/min = 1 every 2 seconds)
|
|
- Admin panel with HTTP Basic Auth (target for brute force)
|
|
- Public search endpoint
|
|
|
|
**How to Confirm:**
|
|
```bash
|
|
# Check for attack patterns in access logs
|
|
sudo zcat /var/log/nginx/posterg_access.log*.gz | \
|
|
awk '{print $1}' | sort | uniq -c | sort -rn | head -20
|
|
|
|
# Look for 401/403 patterns (brute force attempts)
|
|
sudo grep -E " (401|403) " /var/log/nginx/posterg_access.log* | \
|
|
awk '{print $1}' | sort | uniq -c | sort -rn
|
|
|
|
# Check for high request rates
|
|
sudo awk '{print $4}' /var/log/nginx/posterg_access.log | cut -d: -f1-2 | \
|
|
uniq -c | sort -rn | head -20
|
|
```
|
|
|
|
### 5. 🐛 **Application Bug** (LOW PROBABILITY)
|
|
**Evidence:**
|
|
- Database.php recently updated (Mar 24 14:49)
|
|
- 234KB error log indicates errors occurred
|
|
|
|
**How to Confirm:**
|
|
```bash
|
|
# Review nginx errors for PHP fatal errors
|
|
sudo grep "PHP Fatal" /var/log/nginx/posterg_error.log
|
|
|
|
# Check for infinite loops or memory leaks
|
|
sudo grep -E "Maximum execution time|memory limit" /var/log/nginx/posterg_error.log
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended Investigation Steps (For Root User)
|
|
|
|
### Phase 1: Immediate Analysis (5 minutes)
|
|
```bash
|
|
# 1. Check the smoking gun - nginx error log
|
|
sudo tail -500 /var/log/nginx/posterg_error.log | less
|
|
|
|
# 2. Look for OOM killer
|
|
sudo journalctl -b -1 | grep -i "oom\|killed" | tail -50
|
|
|
|
# 3. Check journal around crash time
|
|
sudo journalctl -b -1 --since "2026-03-24 12:00" --until "2026-03-24 13:00" | less
|
|
```
|
|
|
|
### Phase 2: Deeper Analysis (15 minutes)
|
|
```bash
|
|
# 4. Export last boot journal to file for analysis
|
|
sudo journalctl -b -1 --no-pager > /tmp/last_boot_journal.log
|
|
chown theophile:theophile /tmp/last_boot_journal.log
|
|
|
|
# 5. Check PHP-FPM errors
|
|
sudo cat /var/log/php8.4-fpm.log* | grep -E "NOTICE|WARNING|ERROR"
|
|
|
|
# 6. Analyze access patterns before crash
|
|
sudo zcat /var/log/nginx/posterg_access.log*.gz 2>/dev/null | \
|
|
awk '$4 >= "[24/Mar/2026:11:00:" && $4 <= "[24/Mar/2026:13:00:"' | \
|
|
awk '{print $1}' | sort | uniq -c | sort -rn > /tmp/crash_access_analysis.txt
|
|
|
|
# 7. Check for database corruption
|
|
sqlite3 /var/www/posterg/storage/posterg.db "PRAGMA integrity_check;"
|
|
```
|
|
|
|
### Phase 3: System Health Check (10 minutes)
|
|
```bash
|
|
# 8. Review PHP-FPM pool configuration
|
|
cat /etc/php/8.4/fpm/pool.d/www.conf | grep -v "^;" | grep -v "^$"
|
|
|
|
# 9. Check system resource limits
|
|
ulimit -a
|
|
|
|
# 10. Review systemd service limits
|
|
systemctl show php8.4-fpm | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
|
|
systemctl show nginx | grep -E "LimitNOFILE|LimitNPROC|MemoryLimit"
|
|
```
|
|
|
|
---
|
|
|
|
## Preventive Measures to Implement
|
|
|
|
### Immediate (Before Next Investigation)
|
|
1. **Add user to adm group** for log access:
|
|
```bash
|
|
sudo usermod -aG adm theophile
|
|
sudo usermod -aG systemd-journal theophile
|
|
```
|
|
|
|
2. **Enable detailed error logging** (temporarily):
|
|
```bash
|
|
# In /etc/nginx/sites-available/posterg
|
|
error_log /var/log/nginx/posterg_error.log debug;
|
|
sudo systemctl reload nginx
|
|
```
|
|
|
|
3. **Enable PHP-FPM slow log:**
|
|
```bash
|
|
# In /etc/php/8.4/fpm/pool.d/www.conf
|
|
slowlog = /var/log/php8.4-fpm-slow.log
|
|
request_slowlog_timeout = 10s
|
|
sudo systemctl restart php8.4-fpm
|
|
```
|
|
|
|
### Short-term (This Week)
|
|
1. **Tighten rate limits** in nginx config:
|
|
```nginx
|
|
limit_req_zone $binary_remote_addr zone=general:10m rate=10r/m; # Was 30r/m
|
|
limit_req_zone $binary_remote_addr zone=search:10m rate=5r/m; # Was 30r/m
|
|
```
|
|
|
|
2. **Add connection limits:**
|
|
```nginx
|
|
limit_conn_zone $binary_remote_addr zone=addr:10m;
|
|
limit_conn addr 10; # Max 10 concurrent connections per IP
|
|
```
|
|
|
|
3. **Reduce PHP-FPM timeout:**
|
|
```nginx
|
|
fastcgi_read_timeout 60; # Was 120
|
|
```
|
|
|
|
4. **Monitor memory usage:**
|
|
```bash
|
|
# Add to crontab
|
|
*/5 * * * * free -m >> /var/log/memory-monitor.log
|
|
```
|
|
|
|
### Long-term (This Month)
|
|
1. **Migrate from SQLite to PostgreSQL/MySQL** for better concurrency
|
|
2. **Implement application-level logging** (not just nginx/PHP-FPM)
|
|
3. **Add monitoring:** Prometheus + Grafana or similar
|
|
4. **Configure log rotation** more aggressively
|
|
5. **Set up automated alerts** for high memory/CPU usage
|
|
|
|
---
|
|
|
|
## Files to Review (When Root Access Available)
|
|
|
|
### Priority 1 (Most Likely to Show Cause)
|
|
- [ ] `/var/log/nginx/posterg_error.log` (234KB - abnormally large)
|
|
- [ ] Journal logs for boot -1: `journalctl -b -1`
|
|
- [ ] Kernel messages: `dmesg -T`
|
|
|
|
### Priority 2 (Supporting Evidence)
|
|
- [ ] `/var/log/php8.4-fpm.log*`
|
|
- [ ] `/var/log/nginx/posterg_access.log*` (attack pattern analysis)
|
|
- [ ] Systemd service logs: `journalctl -u php8.4-fpm -b -1`, `journalctl -u nginx -b -1`
|
|
|
|
### Priority 3 (Configuration Review)
|
|
- [ ] `/etc/php/8.4/fpm/pool.d/www.conf` (worker limits, timeouts)
|
|
- [ ] `/etc/security/limits.conf` (system resource limits)
|
|
- [ ] `/etc/systemd/system/php8.4-fpm.service.d/` (service overrides)
|
|
|
|
---
|
|
|
|
## Questions to Answer
|
|
|
|
1. **What filled the 234KB error log?** (Compare to normal ~1KB size)
|
|
2. **Was there an OOM killer event?** (Check journalctl and dmesg)
|
|
3. **What happened between March 2-24?** (22-day boot gap is suspicious)
|
|
4. **Were there repeated service crashes/restarts?** (Check systemd journals)
|
|
5. **What was the last request before the crash?** (Check nginx access logs)
|
|
6. **Is there evidence of an attack?** (IP analysis, rate limit hits)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
**For theophile (with sudo access):**
|
|
1. Run Phase 1 commands immediately
|
|
2. Export journal logs to `/tmp/` for detailed review
|
|
3. Review nginx error log and identify patterns
|
|
4. Share findings from logs to determine if application is at fault
|
|
5. Implement immediate preventive measures (user to adm group, slow logging)
|
|
|
|
**For automated monitoring (recommended):**
|
|
- Set up `fail2ban` for admin panel protection
|
|
- Configure `monit` or similar for service health checks
|
|
- Enable automatic log forwarding to external system (prevent data loss on crash)
|
|
|
|
---
|
|
|
|
**Investigation Status:** ⏸️ PAUSED - Awaiting root access to critical logs
|
|
**Risk Level:** 🔴 HIGH - Cause unknown, could recur anytime
|
|
**Recommended Priority:** 🚨 URGENT - Next crash could cause data loss
|
|
|