Guides
Test Data & Databases
Sample databases for testing all 5 UBIOS capabilities across every vertical.
UBIOS needs production-quality test data to validate text-to-SQL, RAG retrieval, document extraction, behavioral scoring, and proactive agents. Each database below serves a specific vertical or capability.
Primary: TPC-DS (Multi-Channel Retail)
The gold standard for BI testing. 25 tables in star schema, 99 predefined queries. Maps to IntelliRetail, IntelliTravel, and IntelliSupply by relabeling.
| Scale Factor | Rows | Size | Use For |
|---|---|---|---|
| SF 1 | ~7M | ~1 GB | Development |
| SF 10 | ~70M | ~10 GB | Customer demos |
| SF 100 | ~570M | ~100 GB | Performance testing |
Setup: CREATE EXTENSION tpcds; CALL tpcds.run(10, 4); — one command in PostgreSQL.
All Test Databases
| Database | Domain | Rows | Tables | Best For |
|---|---|---|---|---|
| TPC-DS | Multi-channel retail | 7M–570M | 25 | Text-to-SQL, dashboards, anomalies |
| Synthea | Healthcare (OMOP CDM) | Up to millions of patients | 70+ | IntelliClinic vertical |
| Stack Exchange | Q&A communities | Up to 60M posts | 8 | RAG at scale, user analytics |
| Pagila | DVD rental | ~45K | 15 | IntelliTravel prototyping |
| Northwind | Food import | ~10K | 14 | IntelliSupply prototyping |
| Chinook | Media sales | ~15K | 11 | IntelliMedia, simple joins |
| IMDB | Movie metadata | ~10M titles | 7 | Streaming analytics |
| MovieLens | User ratings | Up to 32M ratings | 3 | Recommendation, behavioral |
| Employee DB | HR | ~3.9M | 6 | IntelliHR, time-series queries |
Quick Setup: Development Baseline
For a fully functional dev environment with meaningful data:
# 1. TPC-DS (primary — most capabilities)
docker exec -it ubios_api bash -c '
PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c "CREATE EXTENSION IF NOT EXISTS tpcds;"
PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c "CALL tpcds.run(1, 4);"
'
# 2. Chinook (quick media data)
wget -q https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_PostgreSql.sql -O /tmp/chinook.sql
sed 's/CREATE TABLE /CREATE TABLE chinook./g; s/INSERT INTO /INSERT INTO chinook./g' /tmp/chinook.sql | \
docker exec -i ubios_api bash -c 'PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios'
# 3. Employee DB (large HR dataset)
# See docs/testing/12-employee-setup.md for full instructionsBy UBIOS Capability
Text-to-SQL
| Database | Why | Complexity |
|---|---|---|
| TPC-DS | Star schema, 99 benchmark queries | High |
| Chinook | Simple joins, aggregations | Low |
| Employee DB | Temporal queries, window functions | Medium |
| IMDB | Text search + ratings joins | Medium |
Knowledge RAG
| Database | Why | Scale |
|---|---|---|
| Stack Exchange | Real Q&A content, structured tags | Up to 60M posts |
| IMDB plot summaries | Unstructured text + structured metadata | ~10M titles |
Document Extraction
Use the document sets in docs/testing/05-document-sets.md: CORD invoices, SEC 10-K filings, GDPR text.
Behavioral Scoring
| Database | Why |
|---|---|
| TPC-DS (returns data) | Customer return patterns |
| MovieLens | User rating behavior over time |
| Employee DB | Salary progression, title changes |
Proactive Agents
| Database | Why |
|---|---|
| TPC-DS (seasonal data) | Holiday spikes, promotion effectiveness |
| Stack Exchange | Reputation trends, answer quality drift |
Recommended Test Data Plan
Phase 1: Development (this week)
- TPC-DS SF 1 (~7M rows, 1GB) — primary test database
- Chinook (~15K rows) — quick media join testing
- 10 PDF invoices — extraction pipeline test
Phase 2: Customer Demo (before first demo)
- TPC-DS SF 10 (~70M rows, 10GB) — production-scale demo
- DocLayNet-small (804 pages) — extraction showcase
- Synthetic events — behavioral scoring demo
Phase 3: Production Validation
- TPC-DS SF 100 (~570M rows) — full performance benchmark
- Synthea — IntelliClinic vertical validation
- Stack Exchange — large-scale RAG stress test
Detailed setup instructions for each database are in docs/testing/:
01-pagila-setup.md— IntelliTravel prototyping02-northwind-setup.md— IntelliSupply prototyping03-synthea-setup.md— IntelliClinic (healthcare)04-tpc-ds-setup.md— Primary production test DB07-stackexchange-setup.md— Knowledge/community analytics10-chinook-setup.md— IntelliMedia (music sales)11-netflix-movie-setup.md— IntelliMedia (streaming)12-employee-setup.md— IntelliHR (large dataset)