Understanding the Problem
Data Streams in Agentforce Data Cloud are designed to ingest and index data from various sources, including web pages, files, and more. However, when it comes to PDFs, Data Streams often fail to ingest or index the content. This is because PDFs are considered unstructured data, and standard Data Streams are not equipped to handle them.
Another issue that teams face is the 10 search index limit per data source. This limit can be restrictive when creating retrievers across multiple websites or PDF sources. To overcome this limit, teams need to adopt a different approach to ingesting and indexing their data.
Solution Overview
To ingest and index PDFs in Agentforce Data Cloud, teams can use either the Agentforce Data Library feature or an External Blob Store with an Unstructured Data Stream. The Agentforce Data Library is a no-code solution that allows teams to upload files directly, while an External Blob Store like Azure or AWS S3 requires a bit more setup but provides more automation and flexibility.
For teams that need to ingest and index PDFs on a regular basis, using an External Blob Store with an Unstructured Data Stream is the recommended approach. This involves moving the PDFs to a cloud storage bucket, creating a Data Stream using the Cloud Storage Connector, and mapping it to the Unstructured Data Lake Object (UDLO).
Step-by-Step Solution
Here are the steps to ingest and index PDFs in Agentforce Data Cloud using an External Blob Store with an Unstructured Data Stream:
- Move the PDFs to a cloud storage bucket like Azure or AWS S3
- Create a Data Stream using the Cloud Storage Connector
- Map the Data Stream to the Unstructured Data Lake Object (UDLO)
- Configure the Data Stream to ingest and index the PDFs
Data Stream Configuration
/* Data Stream Configuration */
var dataStream = {
"name": "PDF Ingestion Data Stream",
"connector": "Cloud Storage Connector",
"bucket": "pdf-bucket",
"prefix": "pdf-prefix"
};
Overcoming the Search Index Limit
To overcome the 10 search index limit per data source, teams can adopt a consolidated data strategy. This involves ingesting multiple PDF sources into a single Unstructured DMO and using filter logic to separate the data.
The root cause of the search index limit is the way Data Streams are designed to ingest and index data. By consolidating data into a single Unstructured DMO, teams can scale their data strategy without hitting the index limit.
Here are the steps to overcome the search index limit:
- Ingest multiple PDF sources into a single Unstructured DMO
- Add a field to the DMO to separate the data (e.g., Source_Type or Category)
- Create a single Search Index on the consolidated DMO
- Use filter logic in the Retriever to separate the data (e.g., only look at records where Category = ‘Quarterly Reports’)
Checklist for Ingesting and Indexing PDFs
- Use the Agentforce Data Library or an External Blob Store with an Unstructured Data Stream
- Move PDFs to a cloud storage bucket like Azure or AWS S3
- Create a Data Stream using the Cloud Storage Connector
- Map the Data Stream to the Unstructured Data Lake Object (UDLO)
- Configure the Data Stream to ingest and index the PDFs
- Consolidate data into a single Unstructured DMO
- Use filter logic to separate the data
What is the recommended approach for ingesting and indexing PDFs in Agentforce Data Cloud?
The recommended approach is to use an External Blob Store with an Unstructured Data Stream.
How can teams overcome the 10 search index limit per data source?
Teams can overcome the limit by consolidating data into a single Unstructured DMO and using filter logic to separate the data.
What is the Agentforce Data Library?
The Agentforce Data Library is a no-code solution that allows teams to upload files directly.
Can teams use the Agentforce Data Library for regularly updated PDFs?
While the Agentforce Data Library can be used for regularly updated PDFs, it may require manual re-uploading or a custom flow.
Need help shipping this in production?
Genetrix builds and untangles Salesforce Marketing Cloud and Agentforce setups for teams that want it done right the first time. If anything in this post sounds familiar, talk to us before it ships.