Knowledge Pipeline

The knowledge pipeline converts uploaded documents into searchable embeddings that power RAG answers via the KnowledgeAgent.

Pipeline Stages

Upload document (PDF, DOCX, etc.)
  └─► Store file in MinIO
      └─► Dify picks up the document
            ├─ Docling (local, text-based PDFs) → text chunks
            └─ Mistral OCR 3 (API, scanned PDFs) → text chunks
                  └─► Embed chunks via configured embedding model
                        └─► Store in tenant_knowledge.document_chunks (pgvector)

Components

Dify

Manages the entire pipeline: document intake, chunking strategy, embedding generation, and storage. Each knowledge source maps to a Dify dataset (dify_dataset_id in knowledge_sources table).

Dify runs as three Docker services:

dify-api (port 5001) — API endpoint
dify-worker — Celery background job processor
dify-web (port 3080) — Admin UI for managing datasets

Docling

Local document parser for text-based PDFs and DOCX files. Runs as a service accessible at DOCLING_SERVICE_URL (typically http://host.docker.internal:8010).

Mistral OCR 3

API-based OCR for scanned documents and images. Requires MISTRAL_OCR_API_KEY. Without it, only text-based documents work.

pgvector

Embeddings are stored in tenant_knowledge.document_chunks using PostgreSQL's pgvector extension. Each chunk includes:

Column	Purpose
`embedding`	Vector (dimensions depend on model)
`embedding_model`	Model name used (for versioning)
`embedding_dimensions`	Vector size (e.g., 1536 for OpenAI)
`content`	Raw text content
`metadata`	Source document, page number, chunk index

The embedding_model and embedding_dimensions columns track which model produced each embedding. Changing the embedding model requires batch re-embedding of all existing chunks — old embeddings are incompatible with the new model's vector space.

KnowledgeAgent (RAG)

When a user asks a knowledge question:

Question is embedded using the same model as document chunks
pgvector similarity search finds the top-k matching chunks
Chunks + question are passed to the LLM as context
LLM generates a grounded answer with source citations

Knowledge Sources

Each source tracks its status:

Status	Meaning
`ready`	All documents indexed and searchable
`indexing`	Currently processing documents
`error`	Indexing failed, needs attention

Databases

Dify uses its own PostgreSQL database (ubios_dify) for internal state. Document chunks are stored in the tenant's ubios_knowledge schema.