Ingest knowledge
Upload documents to your OpenRAG OpenSearch instance to populate your knowledge base with unique content, such as your own company documents, research papers, or websites. Documents are processed through OpenRAG's knowledge ingestion flows with Docling.
OpenRAG can ingest knowledge from direct file uploads, URLs, and OAuth authenticated connectors.
Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch database. During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected embedding model. Then, the chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
To modify chunking behavior and other ingestion settings, see Knowledge ingestion settings and Inspect and modify flows.
Ingest local files and folders
You can upload files and folders from your local machine to your knowledge base:
-
Click Knowledge to view your OpenSearch knowledge base.
-
Click Add Knowledge to add your own documents to your OpenRAG knowledge base.
-
To upload one file, click File. To upload all documents in a folder, click Folder.
The default path is
~/.openrag/documents. To change this path, see Set the local documents path.
The selected files are processed in the background through the OpenSearch Ingestion flow.
About the OpenSearch Ingestion flow
When you upload documents locally or with OAuth connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.
Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.
The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:
-
Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is
DoclingDocumentdata that contains the extracted text and metadata from the documents. -
Export DoclingDocument component: Exports processed
DoclingDocumentdata to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing. -
DataFrame Operations component: Three of these components run sequentially to add metadata to the document data:
filename,file_size, andmimetype. -
Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.
-
Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables:
CONNECTOR_TYPE,OWNER,OWNER_EMAIL, andOWNER_NAME. -
Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.
-
Embedding Model component: Generates vector embeddings using your selected embedding model.
-
OpenSearch component: Stores the processed documents and their embeddings in a
documentsindex of your OpenRAG OpenSearch knowledge base.The default address for the OpenSearch instance is
https://opensearch:9200. To change this address, edit theOPENSEARCH_PORTenvironment variable.The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select
basicauth mode, which uses theOPENSEARCH_USERNAMEandOPENSEARCH_PASSWORDenvironment variables for authentication instead of JWT.
You can monitor ingestion to see the progress of the uploads and check for failed uploads.
Ingest local files temporarily
When using the OpenRAG Chat, click in the chat input field to upload a file to the current chat session. Files added this way are processed and made available to the agent for the current conversation only. These files aren't stored in the knowledge base permanently.
Ingest files with OAuth connectors
OpenRAG can use OAuth authenticated connectors to ingest documents from the following external services:
- AWS S3
- Google Drive
- Microsoft OneDrive
- Microsoft Sharepoint
These connectors enable seamless ingestion of files from cloud storage to your OpenRAG knowledge base.
Individual users can connect their personal cloud storage accounts to OpenRAG. Each user must separately authorize OpenRAG to access their own cloud storage. When a user connects a cloud storage service, they are redirected to authenticate with that service provider and grant OpenRAG permission to sync documents from their personal cloud storage.
Enable OAuth connectors
Before users can connect their own cloud storage accounts, you must configure the provider's OAuth credentials in OpenRAG. Typically, this requires that you register OpenRAG as an OAuth application in your cloud provider, and then obtain the app's OAuth credentials, such as a client ID and secret key. To enable multiple connectors, you must register an app and generate credentials for each provider.
- TUI-managed services
- Self-managed services
If you use the Terminal User Interface (TUI) to manage your OpenRAG services, enter OAuth credentials on the Advanced Setup page. You can do this during installation, or you can add the credentials afterwards:
-
If OpenRAG is running, click Stop All Services in the TUI.
-
Open the Advanced Setup page, and then add the OAuth credentials for the cloud storage providers that you want to use under API Keys:
- Google: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the Google Cloud Console. For more information, see the Google OAuth client documentation.
- Microsoft: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide Azure application registration credentials for SharePoint and OneDrive. For more information, see the Microsoft Graph OAuth client documentation.
- Amazon: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on Configuring access to AWS applications.
-
Register the redirect URIs shown in the TUI in your OAuth provider. These are the URLs your OAuth provider will use to redirect users back to OpenRAG after they sign in.
-
Click Save Configuration to add the OAuth credentials to your OpenRAG
.envfile. -
Click Start Services to restart the OpenRAG containers with OAuth enabled.
-
Launch the OpenRAG app. You should be prompted to sign in to your OAuth provider before being redirected to your OpenRAG instance.
If you installed OpenRAG with self-managed services, set OAuth credentials in your OpenRAG .env file.
You can do this during initial set up, or you can add the credentials afterwards:
-
Stop all OpenRAG containers:
Dockerdocker stop $(docker ps -q)Podmanpodman stop --all -
Edit your OpenRAG
.envfile to add the OAuth credentials for the cloud storage providers that you want to use:-
Google: Provide your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the Google Cloud Console. For more information, see the Google OAuth client documentation.
GOOGLE_OAUTH_CLIENT_ID=
GOOGLE_OAUTH_CLIENT_SECRET= -
Microsoft: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, provide Azure application registration credentials for SharePoint and OneDrive. For more information, see the Microsoft Graph OAuth client documentation.
MICROSOFT_GRAPH_OAUTH_CLIENT_ID=
MICROSOFT_GRAPH_OAUTH_CLIENT_SECRET= -
Amazon: Provide your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on Configuring access to AWS applications.
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
-
-
Save the
.envfile. -
Restart your OpenRAG containers:
Dockerdocker compose up -dPodmanpodman compose up -d -
Access the OpenRAG frontend at
http://localhost:3000. You should be prompted to sign in to your OAuth provider before being redirected to your OpenRAG instance.
Authenticate and ingest files from cloud storage
After you start OpenRAG with OAuth connectors enabled, each user is prompted to authenticate with the OAuth provider upon accessing your OpenRAG instance.
Individual authentication is required to access a user's cloud storage from your OpenRAG instance.
For example, if a user navigates to the default OpenRAG URL at http://localhost:3000, they are redirected to the OAuth provider's sign-in page.
After authenticating and granting the required permissions for OpenRAG, the user is redirected back to OpenRAG.
To ingest knowledge with an OAuth connector, do the following:
-
Click Knowledge to view your OpenSearch knowledge base.
-
Click Add Knowledge, and then select a storage provider.
-
On the Add Cloud Knowledge page, click Add Files, and then select the files and folders to ingest from the connected storage.
-
Click Ingest Files.
The selected files are processed in the background through the OpenSearch Ingestion flow.
About the OpenSearch Ingestion flow
When you upload documents locally or with OAuth connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.
Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.
The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:
-
Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is
DoclingDocumentdata that contains the extracted text and metadata from the documents. -
Export DoclingDocument component: Exports processed
DoclingDocumentdata to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing. -
DataFrame Operations component: Three of these components run sequentially to add metadata to the document data:
filename,file_size, andmimetype. -
Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.
-
Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables:
CONNECTOR_TYPE,OWNER,OWNER_EMAIL, andOWNER_NAME. -
Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.
-
Embedding Model component: Generates vector embeddings using your selected embedding model.
-
OpenSearch component: Stores the processed documents and their embeddings in a
documentsindex of your OpenRAG OpenSearch knowledge base.The default address for the OpenSearch instance is
https://opensearch:9200. To change this address, edit theOPENSEARCH_PORTenvironment variable.The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select
basicauth mode, which uses theOPENSEARCH_USERNAMEandOPENSEARCH_PASSWORDenvironment variables for authentication instead of JWT.
You can monitor ingestion to see the progress of the uploads and check for failed uploads.
Ingest knowledge from URLs
When using the OpenRAG chat, you can enter URLs into the chat to be ingested in real-time during your conversation.
The chat cannot ingest URLs that end in static document file extensions like .pdf.
To upload these types of files, see Ingest local files and folders and Ingest files with OAuth connectors.
OpenRAG runs the OpenSearch URL Ingestion flow to ingest web content from URLs. This flow isn't directly accessible from the OpenRAG user interface. Instead, this flow is called by the OpenRAG OpenSearch Agent flow as a Model Context Protocol (MCP) tool. The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base. Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it. For more information about MCP in Langflow, see the Langflow documentation on MCP clients and MCP servers.
Monitor ingestion
Document ingestion tasks run in the background.
In the OpenRAG user interface, a badge is shown on Tasks when OpenRAG tasks are active. Click Tasks to inspect and cancel tasks:
-
Active Tasks: All tasks that are Pending, Running, or Processing. For each active task, depending on its state, you can find the task ID, start time, duration, number of files processed, and the total files enqueued for processing.
-
Pending: The task is queued and waiting to start.
-
Running: The task is actively processing files.
-
Processing: The task is performing ingestion operations.
-
Failed: Something went wrong during ingestion, or the task was manually canceled. For troubleshooting advice, see Troubleshoot ingestion.
To stop an active task, click Cancel. Canceling a task stops processing immediately and marks the task as Failed.
Ingestion performance expectations
The following performance test was conducted with Docling Serve.
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes. This equates to approximately 2.4 documents per second.
You can generally expect equal or better performance on developer laptops, and significantly faster performance on servers. Throughput scales with CPU cores, memory, storage speed, and configuration choices, such as the embedding model, chunk size, overlap, and concurrency.
This test returned 12 error, approximately 1.1 percent of the total files ingested. All errors were file-specific, and they didn't stop the pipeline.
Ingestion performance test details
-
Ingestion dataset:
- Total files: 1,083 items mounted
- Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
-
Hardware specifications:
-
Machine: Apple M4 Pro
-
Podman VM:
- Name: podman-machine-default
- Type: applehv
- vCPUs: 7
- Memory: 8 GiB
- Disk size: 100 GiB
-
-
Test results:
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
...
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082 -
Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)
-
Throughput: Approximately 2.4 documents per second
Troubleshoot ingestion
The following issues can occur during document ingestion.
Failed or slow ingestion
If an ingestion task fails, do the following:
- Make sure you are uploading supported file types.
- Split excessively large files into smaller files before uploading.
- Remove unusual embedded content, such as videos or animations, before uploading. Although Docling can replace some non-text content with placeholders during ingestion, some embedded content might cause errors.
- Make sure your Podman/Docker VM has sufficient memory for the ingestion tasks. The minimum recommendation is 8 GB of RAM. If you regularly upload large files, more RAM is recommended. For more information, see Memory issue with Podman on macOS and Container out of memory errors.
- If OCR ingestion fails due to OCR missing, see OCR ingestion fails (easyocr not installed).
Problems when referencing documents in chat
If the OpenRAG Chat doesn't seem to use your documents correctly, browse your knowledge base to confirm that the documents are uploaded in full, and the chunks are correct.
If the documents are present and well-formed, check your knowledge filters. If a global filter is applied, make sure the expected documents are included in the global filter. If the global filter excludes any documents, the agent cannot access those documents unless you apply a chat-level filter or change the global filter.
If text is missing or incorrectly processed, you need to reupload the documents after modifying the ingestion parameters or the documents themselves. For example:
- Break combined documents into separate files for better metadata context.
- Make sure scanned documents are legible enough for extraction, and enable the OCR option. Poorly scanned documents might require additional preparation or rescanning before ingestion.
- Adjust the Chunk Size and Chunk Overlap settings to better suit your documents. Larger chunks provide more context but can include irrelevant information, while smaller chunks yield more precise semantic search but can lack context.
For more information about modifying ingestion parameters and flows, see Knowledge ingestion settings.