Core Concepts
Knowledge Pipeline
How documents flow from upload to searchable embeddings.
The knowledge pipeline converts uploaded documents into searchable embeddings that power RAG answers via the KnowledgeAgent.
Pipeline Stages
Upload document (PDF, DOCX, etc.)
└─► Store file in MinIO
└─► Dify picks up the document
├─ Docling (local, text-based PDFs) → text chunks
└─ Mistral OCR 3 (API, scanned PDFs) → text chunks
└─► Embed chunks via configured embedding model
└─► Store in tenant_knowledge.document_chunks (pgvector)Components
Dify
Manages the entire pipeline: document intake, chunking strategy, embedding generation, and storage. Each knowledge source maps to a Dify dataset (dify_dataset_id in knowledge_sources table).
Dify runs as three Docker services:
dify-api(port 5001) — API endpointdify-worker— Celery background job processordify-web(port 3080) — Admin UI for managing datasets
Docling
Local document parser for text-based PDFs and DOCX files. Runs as a service accessible at DOCLING_SERVICE_URL (typically http://host.docker.internal:8010).
Mistral OCR 3
API-based OCR for scanned documents and images. Requires MISTRAL_OCR_API_KEY. Without it, only text-based documents work.
pgvector
Embeddings are stored in tenant_knowledge.document_chunks using PostgreSQL's pgvector extension. Each chunk includes:
| Column | Purpose |
|---|---|
embedding | Vector (dimensions depend on model) |
embedding_model | Model name used (for versioning) |
embedding_dimensions | Vector size (e.g., 1536 for OpenAI) |
content | Raw text content |
metadata | Source document, page number, chunk index |
Embedding Model Versioning
The embedding_model and embedding_dimensions columns track which model produced each embedding. Changing the embedding model requires batch re-embedding of all existing chunks — old embeddings are incompatible with the new model's vector space.
KnowledgeAgent (RAG)
When a user asks a knowledge question:
- Question is embedded using the same model as document chunks
- pgvector similarity search finds the top-k matching chunks
- Chunks + question are passed to the LLM as context
- LLM generates a grounded answer with source citations
Knowledge Sources
Each source tracks its status:
| Status | Meaning |
|---|---|
ready | All documents indexed and searchable |
indexing | Currently processing documents |
error | Indexing failed, needs attention |
Databases
Dify uses its own PostgreSQL database (ubios_dify) for internal state. Document chunks are stored in the tenant's ubios_knowledge schema.