mirror of
https://codeberg.org/PostERG/xamxam.git
synced 2026-05-06 19:19:19 +02:00
601 lines
16 KiB
Markdown
601 lines
16 KiB
Markdown
# POSTERG Migration Analysis: YAML to SQLite
|
|
|
|
## Executive Summary
|
|
|
|
This analysis examines the migration of the POSTERG thesis archive system from static YAML files to a SQLite database. The goal is to improve performance, scalability, and usability while maintaining the project's core mission of preserving and sharing masters theses from ERG art school.
|
|
|
|
---
|
|
|
|
## Current System Architecture
|
|
|
|
### Repository Structure
|
|
|
|
The project consists of two interconnected repositories:
|
|
|
|
1. **posterg-formulaire**: Submission interface
|
|
- PHP-based web form for students to submit thesis metadata
|
|
- Handles file uploads (PDFs, images, videos, ZIP archives)
|
|
- Generates YAML files with unique identifiers
|
|
- Stores uploaded content in structured directories
|
|
|
|
2. **posterg-website**: Public display interface
|
|
- PHP-based browsing and viewing application
|
|
- Reads YAML files on each request
|
|
- Presents paginated gallery view
|
|
- Individual thesis detail pages with embedded media
|
|
|
|
### Data Model (Current)
|
|
|
|
Each thesis entry contains:
|
|
- **auteurice**: Author name
|
|
- **année**: Graduation year
|
|
- **email**: Contact email
|
|
- **titre**: Thesis title
|
|
- **tag**: Array of keywords/tags
|
|
- **promoteurice**: Thesis supervisor(s)
|
|
- **problématique**: Problem statement
|
|
- **description**: Full description/abstract
|
|
- **orientation**: Program/department
|
|
- **ap**: Additional program designation
|
|
- **couverture**: Cover image path
|
|
- **files**: Array of associated file paths
|
|
|
|
### Storage Structure
|
|
|
|
```
|
|
data/
|
|
├── yaml/ # Metadata files (13 currently)
|
|
├── content/ # Uploaded thesis files
|
|
│ └── {year}/
|
|
│ └── {author}/
|
|
│ └── [files]
|
|
└── cover/ # Cover images
|
|
```
|
|
|
|
Current data volume: 69MB
|
|
|
|
---
|
|
|
|
## Performance Bottlenecks Identified
|
|
|
|
### Critical Issues
|
|
|
|
1. **File System Scanning on Every Request**
|
|
- `glob("data/yaml/*.yaml")` scans directory for every page load
|
|
- No caching mechanism
|
|
- Linear time complexity O(n) for file discovery
|
|
|
|
2. **Complete Dataset Loading**
|
|
- All YAML files parsed on every request
|
|
- Memory consumption scales linearly with dataset size
|
|
- Even pagination requires loading complete dataset
|
|
|
|
3. **Sorting After Loading**
|
|
- All records loaded into memory before sorting
|
|
- Year-based sorting happens at runtime
|
|
- No pre-computed order
|
|
|
|
4. **No Query Optimization**
|
|
- Cannot filter at data source
|
|
- No indexes for common access patterns
|
|
- Tag searches would require parsing all files
|
|
|
|
### Scalability Concerns
|
|
|
|
With current architecture at 100 theses:
|
|
- 100 file system calls per request
|
|
- 100 YAML parse operations per request
|
|
- Full dataset held in memory for sorting
|
|
|
|
With projected growth to 1000+ theses:
|
|
- System becomes unusable
|
|
- Memory exhaustion likely
|
|
- Response times in seconds
|
|
|
|
---
|
|
|
|
## SQLite Migration Benefits
|
|
|
|
### Performance Improvements
|
|
|
|
1. **Query Optimization**
|
|
- Direct pagination with LIMIT/OFFSET
|
|
- Index-based lookups (year, author, tags)
|
|
- Sorted results without in-memory processing
|
|
- Query only needed fields
|
|
|
|
2. **Search Capabilities**
|
|
- Full-text search on titles and descriptions
|
|
- Tag-based filtering with JOIN tables
|
|
- Year range queries
|
|
- Author/supervisor lookups
|
|
- Multi-criteria searches
|
|
|
|
3. **Caching Potential**
|
|
- Database query results cacheable
|
|
- Prepared statements for repeated queries
|
|
- Connection pooling
|
|
|
|
### Functional Enhancements
|
|
|
|
1. **Advanced Filtering**
|
|
- Filter by year, program, supervisor
|
|
- Multi-tag selection
|
|
- Combined criteria (AND/OR logic)
|
|
- Exclusion filters
|
|
|
|
2. **Statistical Views**
|
|
- Theses per year graphs
|
|
- Most common tags
|
|
- Popular programs/orientations
|
|
- Supervisor contribution counts
|
|
|
|
3. **Related Content**
|
|
- "Similar theses" based on tags
|
|
- Same supervisor works
|
|
- Year cohort browsing
|
|
|
|
4. **Data Validation**
|
|
- Schema enforcement
|
|
- Required field validation
|
|
- Foreign key constraints
|
|
- Data type checking
|
|
|
|
---
|
|
|
|
## Proposed Database Schema
|
|
|
|
### Core Tables
|
|
|
|
**theses**
|
|
- id: INTEGER PRIMARY KEY AUTOINCREMENT
|
|
- author: TEXT NOT NULL
|
|
- year: INTEGER NOT NULL
|
|
- email: TEXT
|
|
- title: TEXT NOT NULL
|
|
- supervisor: TEXT
|
|
- problem_statement: TEXT
|
|
- description: TEXT NOT NULL
|
|
- orientation: TEXT
|
|
- additional_program: TEXT
|
|
- cover_image_path: TEXT
|
|
- external_link: TEXT
|
|
- created_at: DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
- updated_at: DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
|
|
**tags**
|
|
- id: INTEGER PRIMARY KEY AUTOINCREMENT
|
|
- name: TEXT UNIQUE NOT NULL
|
|
|
|
**thesis_tags** (many-to-many junction)
|
|
- thesis_id: INTEGER FOREIGN KEY → theses(id)
|
|
- tag_id: INTEGER FOREIGN KEY → tags(id)
|
|
- PRIMARY KEY (thesis_id, tag_id)
|
|
|
|
**files**
|
|
- id: INTEGER PRIMARY KEY AUTOINCREMENT
|
|
- thesis_id: INTEGER FOREIGN KEY → theses(id)
|
|
- file_path: TEXT NOT NULL
|
|
- file_type: TEXT NOT NULL (pdf, image, video, archive)
|
|
- file_size: INTEGER
|
|
- mime_type: TEXT
|
|
- uploaded_at: DATETIME DEFAULT CURRENT_TIMESTAMP
|
|
|
|
### Indexes for Performance
|
|
|
|
```sql
|
|
CREATE INDEX idx_theses_year ON theses(year DESC);
|
|
CREATE INDEX idx_theses_author ON theses(author);
|
|
CREATE INDEX idx_theses_orientation ON theses(orientation);
|
|
CREATE INDEX idx_tags_name ON tags(name);
|
|
CREATE INDEX idx_thesis_tags_thesis ON thesis_tags(thesis_id);
|
|
CREATE INDEX idx_thesis_tags_tag ON thesis_tags(tag_id);
|
|
CREATE INDEX idx_files_thesis ON files(thesis_id);
|
|
```
|
|
|
|
### Full-Text Search
|
|
|
|
```sql
|
|
CREATE VIRTUAL TABLE theses_fts USING fts5(
|
|
title,
|
|
description,
|
|
problem_statement,
|
|
content=theses,
|
|
content_rowid=id
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## Migration Strategy
|
|
|
|
### Phase 1: Database Setup
|
|
|
|
1. Create SQLite database file
|
|
2. Implement schema with all tables
|
|
3. Add indexes and constraints
|
|
4. Set up full-text search tables
|
|
5. Create database access layer (PDO wrapper)
|
|
|
|
### Phase 2: Data Migration
|
|
|
|
1. Build migration script to:
|
|
- Read all existing YAML files
|
|
- Parse and validate data
|
|
- Insert into database tables
|
|
- Maintain file path references
|
|
- Create tag associations
|
|
- Generate migration log
|
|
|
|
2. Verification process:
|
|
- Count validation (YAML count = DB count)
|
|
- Sample data comparison
|
|
- File path integrity check
|
|
- Tag relationship validation
|
|
|
|
### Phase 3: Application Refactoring
|
|
|
|
**Formulaire Changes:**
|
|
1. Replace YAML generation with database INSERT
|
|
2. Maintain file storage structure (unchanged)
|
|
3. Add database transaction handling
|
|
4. Implement tag normalization
|
|
5. Add duplicate detection
|
|
|
|
**Website Changes:**
|
|
1. Replace glob/parse operations with SQL queries
|
|
2. Implement prepared statements for security
|
|
3. Add search functionality
|
|
4. Create filtering UI components
|
|
5. Implement pagination with SQL LIMIT/OFFSET
|
|
|
|
### Phase 4: Enhanced Features
|
|
|
|
1. Search interface (full-text + filters)
|
|
2. Tag cloud visualization
|
|
3. Year-based browsing
|
|
4. Statistics dashboard
|
|
5. Related theses suggestions
|
|
6. Advanced filtering controls
|
|
|
|
### Phase 5: Backward Compatibility
|
|
|
|
1. Keep YAML export functionality
|
|
2. Generate YAML on thesis update (optional)
|
|
3. Maintain file structure unchanged
|
|
4. Preserve URL patterns
|
|
5. Create database backup mechanism
|
|
|
|
---
|
|
|
|
## Implementation Considerations
|
|
|
|
### Data Integrity
|
|
|
|
1. **Validation Rules**
|
|
- Required fields: author, year, title, description
|
|
- Year range validation (1900-current)
|
|
- Email format validation
|
|
- File path existence verification
|
|
|
|
2. **Constraint Handling**
|
|
- Unique file paths within thesis
|
|
- Orphan file prevention (foreign keys)
|
|
- Tag name uniqueness (case-insensitive)
|
|
|
|
3. **Transaction Safety**
|
|
- Atomic thesis creation (metadata + files + tags)
|
|
- Rollback on failure
|
|
- File cleanup on database failure
|
|
|
|
### Security Improvements
|
|
|
|
1. **SQL Injection Prevention**
|
|
- PDO prepared statements throughout
|
|
- Never concatenate user input into queries
|
|
- Parameterized queries only
|
|
|
|
2. **File System Security**
|
|
- Maintain current upload restrictions
|
|
- Database stores sanitized paths only
|
|
- Path traversal prevention
|
|
|
|
3. **Data Exposure**
|
|
- Email address handling policy
|
|
- Optional field visibility controls
|
|
- Query result limiting
|
|
|
|
### Performance Optimization
|
|
|
|
1. **Connection Management**
|
|
- Persistent database connections
|
|
- Connection pooling for concurrent requests
|
|
- Prepared statement caching
|
|
|
|
2. **Query Optimization**
|
|
- Avoid SELECT * queries
|
|
- Use pagination efficiently
|
|
- Leverage indexes for all searches
|
|
- Cache expensive aggregate queries
|
|
|
|
3. **File Handling**
|
|
- Database stores paths only (not binary data)
|
|
- File serving remains filesystem-based
|
|
- Consider CDN for static assets
|
|
|
|
### Maintenance Requirements
|
|
|
|
1. **Database Maintenance**
|
|
- Regular VACUUM operations for database compaction
|
|
- Backup strategy (daily automated backups)
|
|
- Full-text search index updates
|
|
- Periodic ANALYZE for query optimization
|
|
|
|
2. **Migration Tools**
|
|
- YAML export functionality for portability
|
|
- Database schema versioning
|
|
- Update scripts for schema evolution
|
|
|
|
3. **Monitoring**
|
|
- Query performance logging
|
|
- Slow query identification
|
|
- Database size monitoring
|
|
- Error logging and alerting
|
|
|
|
---
|
|
|
|
## Usability Enhancements
|
|
|
|
### New User Features
|
|
|
|
1. **Search Functionality**
|
|
- Free-text search across titles and descriptions
|
|
- Tag-based filtering (AND/OR logic)
|
|
- Year range selection
|
|
- Program/orientation filters
|
|
- Supervisor search
|
|
- Combined multi-criteria searches
|
|
|
|
2. **Navigation Improvements**
|
|
- Alphabetical author index
|
|
- Chronological timeline view
|
|
- Tag cloud with frequency indicators
|
|
- "Browse by" menus (year, program, supervisor)
|
|
|
|
3. **Discovery Features**
|
|
- "Similar theses" recommendations
|
|
- Tag-based related content
|
|
- Same supervisor works
|
|
- Recent additions feed
|
|
- Random thesis discovery
|
|
|
|
4. **Submission Experience**
|
|
- Duplicate detection warnings
|
|
- Tag suggestions from existing tags
|
|
- Form validation with helpful messages
|
|
- Upload progress indicators
|
|
- Preview before submission
|
|
|
|
### Administrative Features
|
|
|
|
1. **Content Management**
|
|
- Edit thesis metadata post-submission
|
|
- Bulk tag operations
|
|
- Merge duplicate tags
|
|
- Thesis approval workflow
|
|
- Content moderation tools
|
|
|
|
2. **Analytics Dashboard**
|
|
- Total theses count
|
|
- Submissions per year graph
|
|
- Most popular tags
|
|
- Program distribution
|
|
- File type statistics
|
|
- Storage usage metrics
|
|
|
|
3. **Data Export**
|
|
- CSV export for analysis
|
|
- JSON API for integrations
|
|
- YAML backup generation
|
|
- Bulk download capabilities
|
|
|
|
---
|
|
|
|
## Performance Projections
|
|
|
|
### Current System (YAML)
|
|
- 13 theses: ~50ms response time
|
|
- 100 theses: ~400ms estimated
|
|
- 1000 theses: ~4000ms estimated (unusable)
|
|
- Memory: Linear growth O(n)
|
|
|
|
### SQLite System (Projected)
|
|
- 13 theses: ~5ms response time
|
|
- 100 theses: ~8ms response time
|
|
- 1000 theses: ~15ms response time
|
|
- 10,000 theses: ~30ms response time (with proper indexes)
|
|
- Memory: Constant per request O(page_size)
|
|
|
|
### Scalability Metrics
|
|
|
|
Current bottleneck: 100 theses before major performance issues
|
|
Post-migration capacity: 10,000+ theses with acceptable performance
|
|
|
|
---
|
|
|
|
## Risk Analysis
|
|
|
|
### Migration Risks
|
|
|
|
1. **Data Loss**
|
|
- Mitigation: Comprehensive backup before migration
|
|
- Verification: Automated comparison scripts
|
|
- Rollback: Keep YAML files as archive
|
|
|
|
2. **Downtime**
|
|
- Mitigation: Migrate offline database first
|
|
- Testing: Staging environment validation
|
|
- Deployment: Blue-green deployment strategy
|
|
|
|
3. **Bug Introduction**
|
|
- Mitigation: Extensive testing suite
|
|
- Validation: User acceptance testing
|
|
- Monitoring: Error tracking system
|
|
|
|
### Operational Risks
|
|
|
|
1. **Database Corruption**
|
|
- Mitigation: Regular automated backups
|
|
- Recovery: Point-in-time restore capability
|
|
- Prevention: Write-ahead logging enabled
|
|
|
|
2. **Concurrency Issues**
|
|
- Mitigation: SQLite WAL mode for better concurrency
|
|
- Testing: Load testing before production
|
|
- Monitoring: Lock timeout tracking
|
|
|
|
3. **Schema Evolution**
|
|
- Mitigation: Migration script versioning
|
|
- Documentation: Schema change log
|
|
- Testing: Migration testing on copies
|
|
|
|
---
|
|
|
|
## Timeline Estimation
|
|
|
|
### Development Phases
|
|
|
|
**Phase 1: Database Design & Setup** (1 week)
|
|
- Schema finalization
|
|
- Database creation scripts
|
|
- Index design
|
|
- Testing framework setup
|
|
|
|
**Phase 2: Migration Script Development** (1 week)
|
|
- YAML parser improvements
|
|
- Database insertion logic
|
|
- Validation and verification
|
|
- Error handling and logging
|
|
|
|
**Phase 3: Application Refactoring** (2 weeks)
|
|
- Database access layer
|
|
- Query optimization
|
|
- Form submission updates
|
|
- Display page refactoring
|
|
- URL structure preservation
|
|
|
|
**Phase 4: Feature Enhancement** (2 weeks)
|
|
- Search implementation
|
|
- Filter UI components
|
|
- Statistics dashboard
|
|
- Related content logic
|
|
- Tag management interface
|
|
|
|
**Phase 5: Testing & Deployment** (1 week)
|
|
- Integration testing
|
|
- Performance testing
|
|
- User acceptance testing
|
|
- Staging deployment
|
|
- Production migration
|
|
|
|
**Total Estimated Timeline: 7 weeks**
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
### Technical Metrics
|
|
|
|
1. **Performance**
|
|
- Page load time < 100ms for gallery view
|
|
- Search response time < 200ms
|
|
- Support 1000+ theses without degradation
|
|
|
|
2. **Reliability**
|
|
- 99.9% uptime
|
|
- Zero data loss during migration
|
|
- All URLs preserved or redirected
|
|
|
|
3. **Functionality**
|
|
- All existing features maintained
|
|
- Search accuracy > 95%
|
|
- Filter combinations work correctly
|
|
|
|
### User Experience Metrics
|
|
|
|
1. **Usability**
|
|
- Submission form completion rate > 90%
|
|
- Search usage > 30% of visits
|
|
- Average session duration increase
|
|
|
|
2. **Content Discovery**
|
|
- Pages per session increase
|
|
- Tag-based navigation usage
|
|
- Related thesis click-through rate
|
|
|
|
3. **Satisfaction**
|
|
- User feedback surveys positive
|
|
- Error rate < 1%
|
|
- Form abandonment rate decrease
|
|
|
|
---
|
|
|
|
## Post-Migration Roadmap
|
|
|
|
### Immediate Next Steps
|
|
|
|
1. Monitor performance metrics
|
|
2. Gather user feedback
|
|
3. Address any bugs or issues
|
|
4. Optimize slow queries
|
|
5. Document new workflows
|
|
|
|
### Future Enhancements
|
|
|
|
1. **API Development**
|
|
- RESTful API for external access
|
|
- JSON endpoints for integrations
|
|
- Authentication for write operations
|
|
|
|
2. **Advanced Features**
|
|
- Citation generation tools
|
|
- DOI/persistent identifier integration
|
|
- Version control for thesis updates
|
|
- Collaborative annotations
|
|
|
|
3. **Integration Opportunities**
|
|
- Library catalog integration (BAIU)
|
|
- ERG website integration
|
|
- Academic search engines
|
|
- Social media sharing
|
|
|
|
4. **Archival Features**
|
|
- Long-term preservation planning
|
|
- Format migration for obsolete files
|
|
- Redundant backup locations
|
|
- Metadata standards compliance (Dublin Core)
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
The migration from YAML static files to SQLite database represents a critical infrastructure improvement for the POSTERG project. The current system, while functional for the initial prototype phase, faces severe scalability limitations that will prevent the project from fulfilling its mission of comprehensive thesis archival and accessibility.
|
|
|
|
### Key Benefits Summary
|
|
|
|
1. **Performance**: 10-50x faster page loads, sustainable with 1000+ theses
|
|
2. **Functionality**: Full-text search, advanced filtering, statistics, related content
|
|
3. **Maintainability**: Structured data, validation, easier debugging and updates
|
|
4. **User Experience**: Better discovery, faster browsing, richer information access
|
|
5. **Sustainability**: Scalable architecture supporting long-term growth
|
|
|
|
### Recommendation
|
|
|
|
Proceed with SQLite migration as outlined in this analysis. The investment in migration and refactoring will pay immediate dividends in performance and enable the feature enhancements that will make POSTERG a truly valuable resource for the ERG community. The proposed timeline of 7 weeks is realistic and provides adequate time for thorough testing and quality assurance.
|
|
|
|
The mission of preserving and democratizing access to ERG theses—challenging the traditional library model that prioritizes grades over accessibility—deserves a technical foundation that can grow with the archive. This migration provides that foundation.
|
|
|
|
---
|
|
|
|
*This analysis prepared for the POSTERG project migration initiative, January 2026.*
|