Creating a Vector Database for RAG

Praveen NG
4 min readJan 4, 2025

--

With the advancement of LLMs, it is now possible to generate human-like language. They can perform a wide-range of tasks, including text generation, translation, summarization, and question-answering.

However, LLMs are trained on a vast amount of data, and they have limitations when it comes to answering specialized queries related to your own data. For example, your company may have some documentations for your products, and you may want to get answers based on the documentation. Since LLMs were not trained on those documents, they won’t be able to answer queries specific to those documents. To make things worse, LLMs may hallucinate and provide inaccurate answers. A similar problem can occur when the data gets updated frequently — It is not always practical to retrain LLMs each time the information gets updated.

Retrieval Augmented Generation (RAG)

Retrieval-Augmented-Generation (RAG) is a technique that can help generate more accurate, relevant, and up-to-date text under these scenarios. In RAG, LLMs are connected to the data in question and hence can serve the most up-to-date or relevant information to a query.

Text Embedding

LLMs cannot connect to your data in raw format (such as text). For LLMs to work efficiently, the data need to be embedded and stored in a vector database. Embedding is the process of converting text data into a vector representation. Here text data is converted into a vector in a high dimensional space. Similar or related text are then represented by vectors coordinates that are close to each other. It is then possible to search the database for data that is relevant for a user query.

The nice thing is that, there are many libraries for embedding text — that is to convert text to a vector representation — and it is easy to created vector databases. In this demo, we will use OpenAIEmbeddings function provided by langchain_openai package.

Creating a Vector Database

In this demo, I will show an easy and simple way to create a vector database. We will use the popular novel Crime and Punishment by Fyodor Dostoyevsky, which is available for free download from Project Gutenberg. You can use any other text data file. Follow the steps below to create a vector database.

Step 1: Get an OpenAI API key and set it as an environmental variable.

$ # On Linux/MacOS
$ export OPENAI_API_KEY="your-api-key-here"

$ # On Windows
$ set OPENAI_API_KEY="your-api-key-here"

Step 2: Download the file from Project Gutenberg here and store it inside a directory called data. As mentioned above, you can use your own text document instead.

$ mkdir data
$ wget https://www.gutenberg.org/cache/epub/2554/pg2554.txt -O data/crime_and_punishment.md

Step 3: Create a virtual environment and install all necessary packages. I use uv to create a virtual environment. You can see how to get started with uv in a previous article.

$ uv venv venv
$ source venv/bin/activate

Alternatively, you can use the package virtual environment.

$ python3 -m virtualenv venv
$ sourcce venv/bin/activate

Next, create a requirements.txt file.

# requirements.txt
langchain==0.3.13
langchain-community==0.3.13
langchain-chroma==0.1.4
unstructured==0.16.11
markdown==3.7
langchain_openai==0.2.14
chromadb==0.5.23
pysqlite3-binary==0.5.4

Then, install the packages using uv or pip

$ # run this if you are using uv
$ uv pip install -r requirements.txt
$ # run this if you are using virtualenvironment
$ pip install -r requirements.txt

Step 4: Use pysqlite3 instead of sqlite3

In some systems, there could be version conflict between Chroma DB and the default sqlite3. To avoid this, we swap sqlite3 with pysqlite3.

__import__("pysqlite3")
import sys

sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

Step 5: Load documents to memory. We use a utility function called DirectoryLoader, which allows loading multiples files.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('data', glob='*.md')
docs = loader.load()
print(len(docs))

The above code should print the number of documents loaded. In my case, I used only one document (crime_and_punishment.md), so it prints “1” to the screen.

Note: If your files have a different extension like .txt, change the glob parameter accordingly.

Step 6: Split data into smaller chunks.

Since data can be very long, we split them into smaller chunks. Each chunk is then embedded. You can choose the chunk size, chunk overlap, and other parameters. Below, I chose some reasonable values.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=300,
length_function=len,
add_start_index=True,
)
chunks = splitter.split_documents(docs)
print(len(chunks))

The whole data is split into 1719 chunks, so “1719” is printed on the screen. If you use a different input file or different values for the parameters, you may get a different number.

Step 7: Embedding and storing data

The final step is to convert each chunk into a vector representation and store it in a vector database. As mentioned earlier, we use OpenAIEmbeddings for embedding, and we use Chroma DB to store data.

from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

embedding_function = OpenAIEmbeddings()
db = Chroma.from_documents(
chunks,
embedding_function,
persist_directory='chroma',
)

If you don’t make any mistake and if everything works fine, you will be now able to see a new directory called chroma. It will contain the vector database artifacts, and you should be able to connect it to a LLM and run queries against it. I will demonstrate those in another post.

Summary:

In this article, I demonstrated how easy it is to create a vector database based on your document file. It can be then connected to a LLM to set up a RAG system. This way, you will be able to query against your own data and generate accurate and up-to-date text response.

If you find this article useful, consider giving it a clap, and following me.

--

--

Praveen NG
Praveen NG

Written by Praveen NG

A data science professional with a research background.

No responses yet