Skip to main content

Ingest knowledge

Upload documents to your OpenRAG OpenSearch instance to populate your knowledge base with unique content, such as your own company documents, research papers, or websites. Documents are processed through OpenRAG's knowledge ingestion flows with Docling.

OpenRAG can ingest knowledge from direct file uploads, URLs, and cloud storage connectors.

Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch database. During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected embedding model. Then, the chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.

To modify chunking behavior and other ingestion settings, see Knowledge ingestion settings and Inspect and modify flows.

Ingest local files and folders

You can upload files and folders from your local machine to your knowledge base:

  1. Click Knowledge to view your OpenSearch knowledge base.

  2. Click Add Knowledge to add your own documents to your OpenRAG knowledge base.

  3. To upload one file, click File. To upload all documents in a folder, click Folder.

    The default path is ~/.openrag/documents. To change this path, see Set the local documents path.

The selected files are processed in the background through the OpenSearch Ingestion flow.

About the OpenSearch Ingestion flow

When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.

Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.

The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:

  • Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is DoclingDocument data that contains the extracted text and metadata from the documents.

  • Export DoclingDocument component: Exports processed DoclingDocument data to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing.

  • DataFrame Operations component: Three of these components run sequentially to add metadata to the document data: filename, file_size, and mimetype.

  • Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.

  • Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables: CONNECTOR_TYPE, OWNER, OWNER_EMAIL, and OWNER_NAME.

  • Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.

  • Embedding Model component: Generates vector embeddings using your selected embedding model.

  • OpenSearch component: Stores the processed documents and their embeddings in a documents index of your OpenRAG OpenSearch knowledge base.

    The default address for the OpenSearch instance is https://opensearch:9200. To change this address, edit the OPENSEARCH_PORT environment variable.

    The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select basic auth mode, which uses the OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD environment variables for authentication instead of JWT.

You can monitor ingestion to see the progress of the uploads and check for failed uploads.

Ingest local files temporarily

When using the OpenRAG Chat, click Add in the chat input field to upload a file to the current chat session. Files added this way are processed and made available to the agent for the current conversation only. These files aren't stored in the knowledge base permanently.

Ingest files with cloud storage connectors

OpenRAG can use cloud storage connectors to ingest documents from the following external services:

  • AWS S3
  • Google Drive
  • Microsoft OneDrive
  • Microsoft Sharepoint

These connectors enable seamless ingestion of files from cloud storage to your OpenRAG knowledge base.

OAuth credentials are used to authorize access to cloud storage services.

Configure cloud storage connectors

Before you can ingest documents from cloud storage, you must authorize OpenRAG's access to your cloud storage services.

Typically, this requires that you register OpenRAG as an OAuth application in your cloud provider, and then obtain the app's OAuth credentials, such as a client ID and secret key. To enable multiple connectors, you must register an app and generate credentials for each provider.

Then, add the OAuth credentials to your OpenRAG configuration:

If you use the Terminal User Interface (TUI) to manage your OpenRAG services, enter OAuth credentials on the Advanced Setup page. You can do this during installation, or you can add the credentials afterwards:

  1. If OpenRAG is running, click Stop All Services in the TUI.

  2. Open the Advanced Setup page, and then add the OAuth credentials for the cloud storage providers that you want to use under API Keys:

  3. For each connector you configured, register the redirect URIs shown in the TUI in your OAuth apps.

    The redirect URIs are used for the cloud storage connector webhooks. For Google, the redirect URIs are also used to redirect users back to OpenRAG after they sign in.

  4. Optional: Under Others, set the Webhook Base URL to the base address for your OAuth connector endpoints. If set, the OAuth connector webhook URLs are constructed as WEBHOOK_BASE_URL/connectors/${provider}/webhook. This option is required to enable automatic ingestion from cloud storage.

  5. Click Save Configuration to add the OAuth credentials to your OpenRAG .env file.

  6. Click Start Services to restart the OpenRAG containers with the new configuration.

  7. Launch the OpenRAG app.

    If you provided Google OAuth credentials, you must sign in with Google before you are redirected to your OpenRAG instance.

Ingest files from cloud storage

To ingest knowledge with a cloud storage connector, do the following:

  1. Click Knowledge to view your OpenSearch knowledge base.

  2. Click Add Knowledge, and then select a storage provider.

  3. On the Add Cloud Knowledge page, click Add Files, and then select the files and folders to ingest from the connected storage.

  4. Click Ingest Files.

The selected files are processed in the background through the OpenSearch Ingestion flow.

About the OpenSearch Ingestion flow

When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.

Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.

The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:

  • Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is DoclingDocument data that contains the extracted text and metadata from the documents.

  • Export DoclingDocument component: Exports processed DoclingDocument data to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing.

  • DataFrame Operations component: Three of these components run sequentially to add metadata to the document data: filename, file_size, and mimetype.

  • Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.

  • Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables: CONNECTOR_TYPE, OWNER, OWNER_EMAIL, and OWNER_NAME.

  • Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.

  • Embedding Model component: Generates vector embeddings using your selected embedding model.

  • OpenSearch component: Stores the processed documents and their embeddings in a documents index of your OpenRAG OpenSearch knowledge base.

    The default address for the OpenSearch instance is https://opensearch:9200. To change this address, edit the OPENSEARCH_PORT environment variable.

    The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select basic auth mode, which uses the OPENSEARCH_USERNAME and OPENSEARCH_PASSWORD environment variables for authentication instead of JWT.

You can monitor ingestion to see the progress of the uploads and check for failed uploads.

Ingest knowledge from URLs

When using the OpenRAG chat, you can enter URLs into the chat to be ingested in real-time during your conversation.

info

The chat cannot ingest URLs that end in static document file extensions like .pdf. To upload these types of files, see Ingest local files and folders and Ingest files with cloud storage connectors.

OpenRAG runs the OpenSearch URL Ingestion flow to ingest web content from URLs. This flow isn't directly accessible from the OpenRAG user interface. Instead, this flow is called by the OpenRAG OpenSearch Agent flow as a Model Context Protocol (MCP) tool. The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base. Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it. For more information about MCP in Langflow, see the Langflow documentation on MCP clients and MCP servers.

Monitor ingestion

Depending on the amount of data to ingest, document ingestion can take a few seconds, minutes, or longer. For this reason, document ingestion tasks run in the background.

In the OpenRAG user interface, a badge is shown on Tasks when OpenRAG tasks are active. Click Tasks to inspect and cancel tasks. Tasks are separated into multiple sections:

  • The Active Tasks section includes all tasks that are Pending, Running, or Processing:

    • Pending: The task is queued and waiting to start.
    • Running: The task is actively processing files.
    • Processing: The task is performing ingestion operations.

    To stop an active task, click Cancel. Canceling a task stops processing immediately and marks the ingestion as failed.

  • The Recent Tasks section lists recently finished tasks.

    warning

    Completed doesn't mean success.

    A completed task can report successful ingestions, failed ingestions, or both, depending on the number of files processed.

    Check the Success and Failed counts for each completed task to determine the overall success rate.

    Failed means something went wrong during ingestion, or the task was manually canceled.

For each task, depending on its state, you can find the task ID, start time, duration, number of files processed successfully, number of files that failed, and the number of files enqueued for processing.

Ingestion performance expectations

The following performance test was conducted with Docling Serve.

On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes. This equates to approximately 2.4 documents per second.

You can generally expect equal or better performance on developer laptops, and significantly faster performance on servers. Throughput scales with CPU cores, memory, storage speed, and configuration choices, such as the embedding model, chunk size, overlap, and concurrency.

This test returned 12 error, approximately 1.1 percent of the total files ingested. All errors were file-specific, and they didn't stop the pipeline.

Ingestion performance test details
  • Ingestion dataset:

    • Total files: 1,083 items mounted
    • Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
  • Hardware specifications:

    • Machine: Apple M4 Pro

    • Podman VM:

      • Name: podman-machine-default
      • Type: applehv
      • vCPUs: 7
      • Memory: 8 GiB
      • Disk size: 100 GiB
  • Test results:

    2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
    2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
    ...
    2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082
  • Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)

  • Throughput: Approximately 2.4 documents per second

See also