RAG Document Ingestion Tutorial

Documentation status

Validation: IN PROGRESS – Based on the rag_document_ingestion DAG and current RAG integration.

Last reviewed: 2025-11-21

Community: If you successfully ingest your own docs (or hit issues), please update this tutorial via Contributing to docs.

This tutorial shows how to ingest your own documentation into the RAG (Retrieval‑Augmented Generation) system used by the Qubinode Navigator AI Assistant.

You will:

Prepare a directory of documents.
Configure the RAG ingestion path.
Run the rag_document_ingestion DAG in Airflow.
Verify that new content is available to the AI Assistant.

1. Prerequisites

Before you begin, you should have:

Qubinode Navigator + Airflow deployed, for example via:
- Getting Started with Apache Airflow for Qubinode Navigator
Access to the Airflow UI (via nginx, as configured in deploy-qubinode-with-airflow.sh).
The RAG integration enabled and the AI Assistant running (see:
- Airflow ↔ RAG Bidirectional Learning
- Airflow Community Ecosystem

2. Prepare Your Documents

Decide which documents you want the AI Assistant to be able to search.

Typical formats:

Markdown (.md)
Text (.txt)
Possibly other formats, depending on your RAG ingestion pipeline configuration.

On the host where Airflow is running, create or use a directory for incoming docs, for example:

sudo mkdir -p /opt/documents/incoming
sudo chown $(whoami) /opt/documents/incoming

Copy or sync your docs into this directory, for example:

cp -r ~/my-docs/*.md /opt/documents/incoming/

The example rag_document_ingestion DAG in airflow-community-ecosystem.md assumes a directory like /opt/documents/incoming.

3. Review the `rag_document_ingestion` DAG

In the repository, the RAG ingestion DAG is described in:

Airflow Community Ecosystem – section “RAG Workflow Integration” and the rag_document_ingestion example.

A simplified version of the DAG flow is:

Scan for new documents in the incoming directory.
Chunk documents into smaller pieces.
Generate embeddings for each chunk.
Store embeddings and text in the vector database.
Notify the AI Assistant that new chunks are available.

Ensure the paths in the DAG match your actual document directory.

4. Run the Ingestion DAG from Airflow UI

Open the Airflow UI in your browser (via nginx):
- http://YOUR_HOST_IP/
In the DAGs list, locate:
- rag_document_ingestion (or your equivalent RAG ingestion DAG).
If the DAG is paused, unpause it.
Trigger a manual run:
- Click the Play button.
- Optionally supply configuration via the run dialog if the DAG expects parameters (e.g., custom document path).
Monitor the run:
- View the Graph or Tree for task statuses.
- Inspect task logs for:
  - Number of documents found.
  - Number of chunks generated.
  - Any errors in embedding generation or vector DB writes.

5. Verify RAG Has the New Documents

After a successful DAG run:

Use the AI Assistant (chat interface) and ask questions that should be answerable from your new docs.
If the RAG system exposes any diagnostic endpoints or CLI tools, use them to:
- List collections.
- Confirm new vectors have been added.

If answers dont reflect new content:

Re-check:
- The document directory path in the DAG.
- File permissions for the Airflow container.
- Any filtering logic in the ingestion code.

6. Keeping RAG Updated

You can keep RAG in sync by:

Scheduling the rag_document_ingestion DAG to run periodically.
Using a separate DAG for incremental updates (see examples in airflow-community-ecosystem.md).

For more advanced RAG workflows (incremental updates, quality monitoring, etc.), see:

7. Contribute Improvements

If you:

Add support for new file types,
Improve chunking/embedding performance, or
Create additional RAG workflows,

please consider:

Adding notes or examples to this tutorial.
Contributing new DAG examples to the repository.