Skip to main content

Ingest knowledge

Upload documents to your OpenRAG OpenSearch knowledge base to populate your knowledge base with unique content, such as your own company documents, research papers, or websites.

OpenRAG can ingest knowledge from direct file uploads, URLs, and cloud storage connectors.

Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch knowledge base.

During ingestion, documents are broken into smaller chunks of content. Embeddings are generated for each chunk so the documents can be retrieved through similarity search during chat, which is a standard retrieval augmented generation (RAG) pattern. The chunks, embeddings, and associated metadata (which connect chunks of the same document) are stored in your OpenSearch knowledge base.

To modify chunking behavior, embedding model, and other ingestion settings, see Configure ingestion.

Ingest local files and folders

You can upload files and folders from your local machine to your knowledge base:

  1. Click Knowledge to view your OpenSearch knowledge base.

  2. Click Add Knowledge to add your own documents to your OpenRAG knowledge base.

  3. To upload one file, click File. To upload all documents in a folder, click Folder.

    The default path is ~/.openrag/documents. To change this path, see Set the local documents path.

The selected files are processed in the background through the OpenSearch Ingestion flow.

About the OpenSearch Ingestion flow

When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.

Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.

The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:

  • Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is DoclingDocument data that contains the extracted text and metadata from the documents.

  • Export DoclingDocument component: Exports processed DoclingDocument data to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing.

  • DataFrame Operations component: Three of these components run sequentially to add metadata to the document data: filename, file_size, and mimetype.

  • Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.

  • Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables: CONNECTOR_TYPE, OWNER, OWNER_EMAIL, and OWNER_NAME.

  • Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.

  • Embedding Model component: Generates vector embeddings using your selected embedding model.

  • OpenSearch component: Stores the processed documents and their embeddings in your OpenRAG OpenSearch knowledge base.

    The default address for the OpenSearch knowledge base is https://localhost:9200. To change this address, edit the OPENSEARCH_PORT environment variable.

    The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select basic auth mode, which uses the OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD environment variables for authentication instead of JWT.

You can monitor ingestion to see the progress of the uploads and check for failed uploads.

Supported file types

When ingesting from the local file system or cloud storage connectors, OpenRAG supports the following file types:

  • .adoc
  • .asciidoc
  • .bmp
  • .csv
  • .doc
  • .docx
  • .gif
  • .htm
  • .html
  • .jpeg
  • .jpg
  • .md
  • .odt
  • .pdf
  • .png
  • .ppt
  • .pptx
  • .rtf
  • .tiff
  • .txt
  • .webp
  • .xls
  • .xlsx

Ingest local files temporarily

When using the OpenRAG Chat, click Add in the chat input field to upload a file to the current chat session. Files added this way are processed and made available to the agent for the current conversation only. These files aren't stored in the knowledge base permanently.

Ingest files from cloud storage

To ingest knowledge from cloud storage using an OpenRAG cloud storage connector, do the following:

  1. Configure a cloud storage connector.

  2. Click Knowledge to view your OpenSearch knowledge base.

  3. Click Add Knowledge, and then select a storage provider.

  4. On the Add Cloud Knowledge page, click Add Files, and then select the files and folders to ingest from the connected storage.

  5. Click Ingest Files.

The selected files are processed in the background through the OpenSearch Ingestion flow.

About the OpenSearch Ingestion flow

When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.

Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.

The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:

  • Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is DoclingDocument data that contains the extracted text and metadata from the documents.

  • Export DoclingDocument component: Exports processed DoclingDocument data to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing.

  • DataFrame Operations component: Three of these components run sequentially to add metadata to the document data: filename, file_size, and mimetype.

  • Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.

  • Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables: CONNECTOR_TYPE, OWNER, OWNER_EMAIL, and OWNER_NAME.

  • Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.

  • Embedding Model component: Generates vector embeddings using your selected embedding model.

  • OpenSearch component: Stores the processed documents and their embeddings in your OpenRAG OpenSearch knowledge base.

    The default address for the OpenSearch knowledge base is https://localhost:9200. To change this address, edit the OPENSEARCH_PORT environment variable.

    The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select basic auth mode, which uses the OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD environment variables for authentication instead of JWT.

You can monitor ingestion to see the progress of the uploads and check for failed uploads.

Ingest knowledge from URLs

When using the OpenRAG chat, you can enter URLs into the chat to be ingested in real-time during your conversation.

OpenRAG runs the OpenSearch URL Ingestion flow to ingest web content from URLs. This flow isn't directly accessible from the OpenRAG user interface, and it doesn't use Docling. Instead, this flow is called by the OpenRAG OpenSearch Agent flow as a Model Context Protocol (MCP) tool in Langflow. The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base.

tip

Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it.

To ingest URLs recursively, edit the Depth parameter in the OpenSearch URL Ingestion flow.

The OpenRAG chat cannot ingest URLs that end in static document file extensions like .pdf. To upload these types of files, see Ingest local files and folders and Ingest files with cloud storage connectors.

Monitor ingestion

Depending on the amount of data to ingest, document ingestion can take a few seconds, minutes, or longer. For this reason, document ingestion tasks run in the background.

In the OpenRAG user interface, a badge is shown on Tasks when OpenRAG tasks are active. Click Tasks to inspect and cancel tasks. Tasks are separated into multiple sections:

  • The Active Tasks section includes all tasks that are Pending, Running, or Processing:

    • Pending: The task is queued and waiting to start.
    • Running: The task is actively processing files.
    • Processing: The task is performing ingestion operations.

    To stop an active task, click Cancel. Canceling a task stops processing immediately and marks the ingestion as failed.

  • The Recent Tasks section lists recently finished tasks.

    warning

    Completed doesn't mean success.

    A completed task can report successful ingestions, failed ingestions, or both, depending on the number of files processed.

    Check the Success and Failed counts for each completed task to determine the overall success rate.

    Failed means something went wrong during ingestion, or the task was manually canceled.

For each task, depending on its state, you can find the task ID, start time, duration, number of files processed successfully, number of files that failed, and the number of files enqueued for processing.

Ingestion performance expectations

Ingestion performance depends on many factors, such as the number and size of files ingested, file types, file contents, embedding model, chunk size, and hardware resources.

Particularly when ingesting folders and very large files, such as PDFs with more than 300 pages, ingestion can take a long time or time out. For more information, see Troubleshoot document ingestion or similarity search issues

Example: Ingestion performance test

The following performance test was conducted with Docling Serve.

On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes. This equates to approximately 2.4 documents per second.

You can generally expect equal or better performance on developer laptops, and significantly faster performance on servers. Throughput scales with CPU cores, memory, storage speed, and configuration choices, such as the embedding model, chunk size, overlap, and concurrency.

This test returned 12 error, approximately 1.1 percent of the total files ingested. All errors were file-specific, and they didn't stop the pipeline.

  • Ingestion dataset:

    • Total files: 1,083 items mounted
    • Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
  • Hardware specifications:

    • Machine: Apple M4 Pro

    • Podman VM:

      • Name: podman-machine-default
      • Type: applehv
      • vCPUs: 7
      • Memory: 8 GiB
      • Disk size: 100 GiB
  • Test results:

    2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
    2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
    ...
    2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
  • Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)

  • Throughput: Approximately 2.4 documents per second

See also