Core Concepts

Knowledge Pipeline

How documents flow from upload to searchable embeddings.

The knowledge pipeline converts uploaded documents into searchable embeddings that power RAG answers via the KnowledgeAgent.

Pipeline Stages

Upload document (PDF, DOCX, etc.)
  └─► Store file in MinIO
      └─► Dify picks up the document
            ├─ Docling (local, text-based PDFs) → text chunks
            └─ Mistral OCR 3 (API, scanned PDFs) → text chunks
                  └─► Embed chunks via configured embedding model
                        └─► Store in tenant_knowledge.document_chunks (pgvector)

Components

Dify

Manages the entire pipeline: document intake, chunking strategy, embedding generation, and storage. Each knowledge source maps to a Dify dataset (dify_dataset_id in knowledge_sources table).

Dify runs as three Docker services:

  • dify-api (port 5001) — API endpoint
  • dify-worker — Celery background job processor
  • dify-web (port 3080) — Admin UI for managing datasets

Docling

Local document parser for text-based PDFs and DOCX files. Runs as a service accessible at DOCLING_SERVICE_URL (typically http://host.docker.internal:8010).

Mistral OCR 3

API-based OCR for scanned documents and images. Requires MISTRAL_OCR_API_KEY. Without it, only text-based documents work.

pgvector

Embeddings are stored in tenant_knowledge.document_chunks using PostgreSQL's pgvector extension. Each chunk includes:

ColumnPurpose
embeddingVector (dimensions depend on model)
embedding_modelModel name used (for versioning)
embedding_dimensionsVector size (e.g., 1536 for OpenAI)
contentRaw text content
metadataSource document, page number, chunk index

Embedding Model Versioning

The embedding_model and embedding_dimensions columns track which model produced each embedding. Changing the embedding model requires batch re-embedding of all existing chunks — old embeddings are incompatible with the new model's vector space.

KnowledgeAgent (RAG)

When a user asks a knowledge question:

  1. Question is embedded using the same model as document chunks
  2. pgvector similarity search finds the top-k matching chunks
  3. Chunks + question are passed to the LLM as context
  4. LLM generates a grounded answer with source citations

Knowledge Sources

Each source tracks its status:

StatusMeaning
readyAll documents indexed and searchable
indexingCurrently processing documents
errorIndexing failed, needs attention

Databases

Dify uses its own PostgreSQL database (ubios_dify) for internal state. Document chunks are stored in the tenant's ubios_knowledge schema.