Ingest knowledge
Upload documents to your OpenRAG OpenSearch instance to populate your knowledge base with unique content, such as your own company documents, research papers, or websites. Documents are processed through OpenRAG's knowledge ingestion flows with Docling.
OpenRAG can ingest knowledge from direct file uploads, URLs, and cloud storage connectors.
Knowledge ingestion is powered by OpenRAG's built-in knowledge ingestion flows that use Docling to process documents before storing the documents in your OpenSearch database. During ingestion, documents are broken into smaller chunks of content that are then embedded using your selected embedding model. Then, the chunks, embeddings, and associated metadata (which connects chunks of the same document) are stored in your OpenSearch database.
To modify chunking behavior and other ingestion settings, see Knowledge ingestion settings and Inspect and modify flows.
Ingest local files and folders
You can upload files and folders from your local machine to your knowledge base:
-
Click Knowledge to view your OpenSearch knowledge base.
-
Click Add Knowledge to add your own documents to your OpenRAG knowledge base.
-
To upload one file, click File. To upload all documents in a folder, click Folder.
The default path is
~/.openrag/documents. To change this path, see Set the local documents path.
The selected files are processed in the background through the OpenSearch Ingestion flow.
About the OpenSearch Ingestion flow
When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.
Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.
The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:
-
Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is
DoclingDocumentdata that contains the extracted text and metadata from the documents. -
Export DoclingDocument component: Exports processed
DoclingDocumentdata to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing. -
DataFrame Operations component: Three of these components run sequentially to add metadata to the document data:
filename,file_size, andmimetype. -
Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.
-
Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables:
CONNECTOR_TYPE,OWNER,OWNER_EMAIL, andOWNER_NAME. -
Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.
-
Embedding Model component: Generates vector embeddings using your selected embedding model.
-
OpenSearch component: Stores the processed documents and their embeddings in a
documentsindex of your OpenRAG OpenSearch knowledge base.The default address for the OpenSearch instance is
https://opensearch:9200. To change this address, edit theOPENSEARCH_PORTenvironment variable.The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select
basicauth mode, which uses theOPENSEARCH_USERNAMEandOPENSEARCH_PASSWORDenvironment variables for authentication instead of JWT.
You can monitor ingestion to see the progress of the uploads and check for failed uploads.
Ingest local files temporarily
When using the OpenRAG Chat, click Add in the chat input field to upload a file to the current chat session. Files added this way are processed and made available to the agent for the current conversation only. These files aren't stored in the knowledge base permanently.
Ingest files with cloud storage connectors
OpenRAG can use cloud storage connectors to ingest documents from the following external services:
- AWS S3
- Google Drive
- Microsoft OneDrive
- Microsoft Sharepoint
These connectors enable seamless ingestion of files from cloud storage to your OpenRAG knowledge base.
OAuth credentials are used to authorize access to cloud storage services.
Configure cloud storage connectors
Before you can ingest documents from cloud storage, you must authorize OpenRAG's access to your cloud storage services.
Typically, this requires that you register OpenRAG as an OAuth application in your cloud provider, and then obtain the app's OAuth credentials, such as a client ID and secret key. To enable multiple connectors, you must register an app and generate credentials for each provider.
Then, add the OAuth credentials to your OpenRAG configuration:
- TUI-managed services
- Self-managed services
If you use the Terminal User Interface (TUI) to manage your OpenRAG services, enter OAuth credentials on the Advanced Setup page. You can do this during installation, or you can add the credentials afterwards:
-
If OpenRAG is running, click Stop All Services in the TUI.
-
Open the Advanced Setup page, and then add the OAuth credentials for the cloud storage providers that you want to use under API Keys:
-
Google: Enter your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the Google Cloud Console. For more information, see the Google OAuth client documentation.
Providing these Google credentials enables OAuth mode and the Google Drive cloud storage connector.
warningGoogle is the only supported OAuth provider for OpenRAG.
You must enter Google credentials if you want to enable OAuth mode.
The Microsoft and Amazon credentials are used only to authorize the cloud storage connectors. OpenRAG doesn't offer OAuth provider integrations for Microsoft or Amazon.
-
Microsoft: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, enter Azure application registration credentials for SharePoint and OneDrive. For more information, see the Microsoft Graph OAuth client documentation.
-
Amazon: Enter your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on Configuring access to AWS applications.
-
-
For each connector you configured, register the redirect URIs shown in the TUI in your OAuth apps.
The redirect URIs are used for the cloud storage connector webhooks. For Google, the redirect URIs are also used to redirect users back to OpenRAG after they sign in.
-
Optional: Under Others, set the Webhook Base URL to the base address for your OAuth connector endpoints. If set, the OAuth connector webhook URLs are constructed as
WEBHOOK_BASE_URL/connectors/${provider}/webhook. This option is required to enable automatic ingestion from cloud storage. -
Click Save Configuration to add the OAuth credentials to your OpenRAG
.envfile. -
Click Start Services to restart the OpenRAG containers with the new configuration.
-
Launch the OpenRAG app.
If you provided Google OAuth credentials, you must sign in with Google before you are redirected to your OpenRAG instance.
If you installed OpenRAG with self-managed services, set OAuth credentials in your OpenRAG .env file.
You can do this during initial set up, or you can add the credentials afterwards:
-
Stop all OpenRAG containers:
Dockerdocker stop $(docker ps -q)Podmanpodman stop --all -
Edit your OpenRAG
.envfile, and then add the OAuth and cloud storage environment variables for the providers that you want to use:GOOGLE_OAUTH_CLIENT_ID=
GOOGLE_OAUTH_CLIENT_SECRET=
MICROSOFT_GRAPH_OAUTH_CLIENT_ID=
MICROSOFT_GRAPH_OAUTH_CLIENT_SECRET=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=-
Google: Enter your Google OAuth Client ID and Google OAuth Client Secret. You can generate these in the Google Cloud Console. For more information, see the Google OAuth client documentation.
Providing these Google credentials enables OAuth mode and the Google Drive cloud storage connector.
warningGoogle is the only supported OAuth provider for OpenRAG.
You must enter Google credentials if you want to enable OAuth mode.
The Microsoft and Amazon credentials are used only to authorize the cloud storage connectors. OpenRAG doesn't offer OAuth provider integrations for Microsoft or Amazon.
-
Microsoft: For the Microsoft OAuth Client ID and Microsoft OAuth Client Secret, enter Azure application registration credentials for SharePoint and OneDrive. For more information, see the Microsoft Graph OAuth client documentation.
-
Amazon: Enter your AWS Access Key ID and AWS Secret Access Key with access to your S3 instance. For more information, see the AWS documentation on Configuring access to AWS applications.
-
-
Optional: Set the
WEBHOOK_BASE_URLto the base address for your OAuth connector endpoints. If set, the OAuth connector webhook URLs are constructed asWEBHOOK_BASE_URL/connectors/${provider}/webhook. This option is required to enable automatic ingestion from cloud storage. -
Save the
.envfile. -
For each connector, you must register the OpenRAG redirect URIs in your OAuth apps:
- Local deployments:
http://localhost:3000/auth/callback - Production deployments:
https://your-domain.com/auth/callback
The redirect URIs are used for the cloud storage connector webhooks. For Google, the redirect URIs are also used to redirect users back to OpenRAG after they sign in.
- Local deployments:
-
Restart your OpenRAG containers:
Dockerdocker compose up -dPodmanpodman compose up -d -
Access the OpenRAG frontend at
http://localhost:3000.If you provided Google OAuth credentials, you must sign in with Google before you are redirected to your OpenRAG instance.
Ingest files from cloud storage
To ingest knowledge with a cloud storage connector, do the following:
-
Click Knowledge to view your OpenSearch knowledge base.
-
Click Add Knowledge, and then select a storage provider.
-
On the Add Cloud Knowledge page, click Add Files, and then select the files and folders to ingest from the connected storage.
-
Click Ingest Files.
The selected files are processed in the background through the OpenSearch Ingestion flow.
About the OpenSearch Ingestion flow
When you upload documents locally or with cloud storage connectors, the OpenSearch Ingestion flow runs in the background. By default, this flow uses Docling Serve to import and process documents.
Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it if you want to change the knowledge ingestion settings.
The OpenSearch Ingestion flow is comprised of several components that work together to process and store documents in your knowledge base:
-
Docling Serve component: Ingests files and processes them by connecting to OpenRAG's local Docling Serve service. The output is
DoclingDocumentdata that contains the extracted text and metadata from the documents. -
Export DoclingDocument component: Exports processed
DoclingDocumentdata to Markdown format with image placeholders. This conversion standardizes the document data in preparation for further processing. -
DataFrame Operations component: Three of these components run sequentially to add metadata to the document data:
filename,file_size, andmimetype. -
Split Text component: Splits the processed text into chunks, based on the configured chunk size and overlap settings.
-
Secret Input component: If needed, four of these components securely fetch the OAuth authentication configuration variables:
CONNECTOR_TYPE,OWNER,OWNER_EMAIL, andOWNER_NAME. -
Create Data component: Combines the authentication credentials from the Secret Input components into a structured data object that is associated with the document embeddings.
-
Embedding Model component: Generates vector embeddings using your selected embedding model.
-
OpenSearch component: Stores the processed documents and their embeddings in a
documentsindex of your OpenRAG OpenSearch knowledge base.The default address for the OpenSearch instance is
https://opensearch:9200. To change this address, edit theOPENSEARCH_PORTenvironment variable.The default authentication method is JSON Web Token (JWT) authentication. If you edit the flow, you can select
basicauth mode, which uses theOPENSEARCH_USERNAMEandOPENSEARCH_PASSWORDenvironment variables for authentication instead of JWT.
You can monitor ingestion to see the progress of the uploads and check for failed uploads.
Ingest knowledge from URLs
When using the OpenRAG chat, you can enter URLs into the chat to be ingested in real-time during your conversation.
The chat cannot ingest URLs that end in static document file extensions like .pdf.
To upload these types of files, see Ingest local files and folders and Ingest files with cloud storage connectors.
OpenRAG runs the OpenSearch URL Ingestion flow to ingest web content from URLs. This flow isn't directly accessible from the OpenRAG user interface. Instead, this flow is called by the OpenRAG OpenSearch Agent flow as a Model Context Protocol (MCP) tool. The agent can call this component to fetch web content from a given URL, and then ingest that content into your OpenSearch knowledge base. Like all OpenRAG flows, you can inspect the flow in Langflow, and you can customize it. For more information about MCP in Langflow, see the Langflow documentation on MCP clients and MCP servers.
Monitor ingestion
Depending on the amount of data to ingest, document ingestion can take a few seconds, minutes, or longer. For this reason, document ingestion tasks run in the background.
In the OpenRAG user interface, a badge is shown on Tasks when OpenRAG tasks are active. Click Tasks to inspect and cancel tasks. Tasks are separated into multiple sections:
-
The Active Tasks section includes all tasks that are Pending, Running, or Processing:
- Pending: The task is queued and waiting to start.
- Running: The task is actively processing files.
- Processing: The task is performing ingestion operations.
To stop an active task, click Cancel. Canceling a task stops processing immediately and marks the ingestion as failed.
-
The Recent Tasks section lists recently finished tasks.
warningCompleted doesn't mean success.
A completed task can report successful ingestions, failed ingestions, or both, depending on the number of files processed.
Check the Success and Failed counts for each completed task to determine the overall success rate.
Failed means something went wrong during ingestion, or the task was manually canceled.
For each task, depending on its state, you can find the task ID, start time, duration, number of files processed successfully, number of files that failed, and the number of files enqueued for processing.
Ingestion performance expectations
The following performance test was conducted with Docling Serve.
On a local VM with 7 vCPUs and 8 GiB RAM, OpenRAG ingested approximately 5.03 GB across 1,083 files in about 42 minutes. This equates to approximately 2.4 documents per second.
You can generally expect equal or better performance on developer laptops, and significantly faster performance on servers. Throughput scales with CPU cores, memory, storage speed, and configuration choices, such as the embedding model, chunk size, overlap, and concurrency.
This test returned 12 error, approximately 1.1 percent of the total files ingested. All errors were file-specific, and they didn't stop the pipeline.
Ingestion performance test details
-
Ingestion dataset:
- Total files: 1,083 items mounted
- Total size on disk: 5,026,474,862 bytes (approximately 5.03 GB)
-
Hardware specifications:
-
Machine: Apple M4 Pro
-
Podman VM:
- Name: podman-machine-default
- Type: applehv
- vCPUs: 7
- Memory: 8 GiB
- Disk size: 100 GiB
-
-
Test results:
2025-09-24T22:40:45.542190Z /app/src/main.py:231 Ingesting default documents when ready disable_langflow_ingest=False
2025-09-24T22:40:45.546385Z /app/src/main.py:270 Using Langflow ingestion pipeline for default documents file_count=1082
...
2025-09-24T23:19:44.866365Z /app/src/main.py:351 Langflow ingestion completed success_count=1070 error_count=12 total_files=1082 -
Elapsed time: Approximately 42 minutes 15 seconds (2,535 seconds)
-
Throughput: Approximately 2.4 documents per second