Ingest Arbitrary Documents
A set of backend APIs are provided to take in and process arbitrary document data sent directly to the backend API server. This is generally used for:
This example creates a new Document in Danswer of the “Web” type. This document will now show up in Danswer’s search flows like any other webpage pulled in by a Web connector.
Note: The Bearer auth token is generated on server startup in Danswer MIT. There is better API Key support as part of Danswer EE.
See below for a breakdown of the different fields provided:
id
: this is the unique ID of the document, if a document of this ID exists it will be updated/replaced.
If not provided, a document ID is generated from the semantic_identifier field instead and returned in the
response.sections
: list of sections each containing textual content and an optional link. The document chunking
tries to avoid splitting sections internally and favors splitting at section borders. Also the link of the
document at query time is the link of the best matched section.source
: Source type, full list can be checked by searching for DocumentSource
heresemantic_identifier
: This is the “Title” of the document as shown in the UI (see image below)metadata
: Used for the “Tags” feature which is displayed in the UI. The values can be either strings
or list of stringsdoc_updated_at
: The time that the document was last considered updated. By default there is a time
based score decay around this value when the document is considered during search.cc_pair_id
: This is the “Connector” ID seen on the Connector Status pages. For example, if running
locally, it might be http://localhost:3000/admin/connector/2
. This allows attaching the ingestion doc
to existing connectors so they can be assigned to groups or deleted together with the connector. If not
provided or set to 1
explicitly, it is considered part of the default catch-all connector.For even more details, the code for the relevant object is found here, called “DocumentBase”
An API is also provided to fetch all of the documents that have been indexed via the Ingestion API