How to Augment ChatGPT with Custom Data

2 min readMay 2, 2023

Large language models (LLMs) like ChatGPT and GPT-4 are powerful tools for generating high-quality text for various applications. However, these models are limited to the information contained within their training datasets, which can be problematic for applications that require domain-specific or technical knowledge.

To solve this problem, we can provide context to LLMs by augmenting them with our own custom documents and data. This can be achieved by using document embeddings, which are numerical vectors that capture the semantic aspects of text. By creating embeddings for our documents and using them to search for the most relevant document to a user’s prompt, we can provide context-aware responses from the LLM.

To implement this framework, we need to create an embedding database for our documents and use a vector database to store and query them. We can use online services like OpenAI’s Embeddings API or other embedding services like Hugging Face or custom transformer models to create embeddings. We can also use vector databases like Faiss by Facebook or Pinecone to store and query the embeddings.

When implementing this framework, we need to consider token limitations and ensure consistency in the embedding framework we use across the entire application. We can also retrieve multiple documents whose embeddings are similar to a prompt to obtain responses.

While fine-tuning LLMs is a good option, it can be costly and complicated. Using context embeddings is an easy and cost-effective option to augment LLMs with our own data and documents. Eventually, with a good data-collection pipeline, we can improve our system by fine-tuning a model for our specific purposes.

A good data-collection pipeline for augmenting LLMs with custom data and documents involves several steps. Here are some examples:

Identify the domain-specific or technical knowledge that the LLM needs to acquire.
Collect relevant documents and data from various sources such as web pages, databases, or APIs.
Preprocess the data by cleaning and filtering out irrelevant information. You may also need to extract specific information such as keywords, entities, or topics.
Create embeddings for the preprocessed data using a suitable embedding model, such as word2vec or BERT.
Store the embeddings in a vector database for efficient querying and retrieval.
Use the embeddings to match user prompts with the most relevant documents and provide context-aware responses from the LLM.
Monitor the performance of the system and continually update and refine the data collection pipeline to improve the LLM’s accuracy and effectiveness.

Examples of data collection pipelines include:

Collecting and preprocessing financial data such as stock prices, earnings reports, and news articles to train an LLM to provide investment advice.
Collecting and preprocessing medical data such as patient records, research papers, and drug information to train an LLM to assist doctors in diagnosing and treating patients.
Collecting and preprocessing legal data such as case law, statutes, and regulations to train an LLM to assist lawyers in legal research and document drafting.
Collecting and preprocessing customer support data such as FAQs, support tickets, and chat logs to train an LLM to provide personalized and accurate customer support.

How to Augment ChatGPT with Custom Data

Written by Ken Huang

No responses yet