Guides

Test Data & Databases

Sample databases for testing all 5 UBIOS capabilities across every vertical.

UBIOS needs production-quality test data to validate text-to-SQL, RAG retrieval, document extraction, behavioral scoring, and proactive agents. Each database below serves a specific vertical or capability.

Primary: TPC-DS (Multi-Channel Retail)

The gold standard for BI testing. 25 tables in star schema, 99 predefined queries. Maps to IntelliRetail, IntelliTravel, and IntelliSupply by relabeling.

Scale FactorRowsSizeUse For
SF 1~7M~1 GBDevelopment
SF 10~70M~10 GBCustomer demos
SF 100~570M~100 GBPerformance testing

Setup: CREATE EXTENSION tpcds; CALL tpcds.run(10, 4); — one command in PostgreSQL.

All Test Databases

DatabaseDomainRowsTablesBest For
TPC-DSMulti-channel retail7M–570M25Text-to-SQL, dashboards, anomalies
SyntheaHealthcare (OMOP CDM)Up to millions of patients70+IntelliClinic vertical
Stack ExchangeQ&A communitiesUp to 60M posts8RAG at scale, user analytics
PagilaDVD rental~45K15IntelliTravel prototyping
NorthwindFood import~10K14IntelliSupply prototyping
ChinookMedia sales~15K11IntelliMedia, simple joins
IMDBMovie metadata~10M titles7Streaming analytics
MovieLensUser ratingsUp to 32M ratings3Recommendation, behavioral
Employee DBHR~3.9M6IntelliHR, time-series queries

Quick Setup: Development Baseline

For a fully functional dev environment with meaningful data:

# 1. TPC-DS (primary — most capabilities)
docker exec -it ubios_api bash -c '
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c "CREATE EXTENSION IF NOT EXISTS tpcds;"
  PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios -c "CALL tpcds.run(1, 4);"
'

# 2. Chinook (quick media data)
wget -q https://raw.githubusercontent.com/lerocha/chinook-database/master/ChinookDatabase/DataSources/Chinook_PostgreSql.sql -O /tmp/chinook.sql
sed 's/CREATE TABLE /CREATE TABLE chinook./g; s/INSERT INTO /INSERT INTO chinook./g' /tmp/chinook.sql | \
  docker exec -i ubios_api bash -c 'PGPASSWORD=331331331 psql -U postgres -h ubios_postgres -d ubios'

# 3. Employee DB (large HR dataset)
# See docs/testing/12-employee-setup.md for full instructions

By UBIOS Capability

Text-to-SQL

DatabaseWhyComplexity
TPC-DSStar schema, 99 benchmark queriesHigh
ChinookSimple joins, aggregationsLow
Employee DBTemporal queries, window functionsMedium
IMDBText search + ratings joinsMedium

Knowledge RAG

DatabaseWhyScale
Stack ExchangeReal Q&A content, structured tagsUp to 60M posts
IMDB plot summariesUnstructured text + structured metadata~10M titles

Document Extraction

Use the document sets in docs/testing/05-document-sets.md: CORD invoices, SEC 10-K filings, GDPR text.

Behavioral Scoring

DatabaseWhy
TPC-DS (returns data)Customer return patterns
MovieLensUser rating behavior over time
Employee DBSalary progression, title changes

Proactive Agents

DatabaseWhy
TPC-DS (seasonal data)Holiday spikes, promotion effectiveness
Stack ExchangeReputation trends, answer quality drift

Phase 1: Development (this week)

  1. TPC-DS SF 1 (~7M rows, 1GB) — primary test database
  2. Chinook (~15K rows) — quick media join testing
  3. 10 PDF invoices — extraction pipeline test

Phase 2: Customer Demo (before first demo)

  1. TPC-DS SF 10 (~70M rows, 10GB) — production-scale demo
  2. DocLayNet-small (804 pages) — extraction showcase
  3. Synthetic events — behavioral scoring demo

Phase 3: Production Validation

  1. TPC-DS SF 100 (~570M rows) — full performance benchmark
  2. Synthea — IntelliClinic vertical validation
  3. Stack Exchange — large-scale RAG stress test

Detailed setup instructions for each database are in docs/testing/:

  • 01-pagila-setup.md — IntelliTravel prototyping
  • 02-northwind-setup.md — IntelliSupply prototyping
  • 03-synthea-setup.md — IntelliClinic (healthcare)
  • 04-tpc-ds-setup.md — Primary production test DB
  • 07-stackexchange-setup.md — Knowledge/community analytics
  • 10-chinook-setup.md — IntelliMedia (music sales)
  • 11-netflix-movie-setup.md — IntelliMedia (streaming)
  • 12-employee-setup.md — IntelliHR (large dataset)