Working With LLMs: Embeddings, Vector Databases, And Search
In this technical guide, we focus on the essential steps of text processing: text vectorization, vector search, and question answering.
This article is part 2 of 5, of our series.
You can download the code from the link in description.
The process begins with converting text into embedding vectors and storing them in a vector index or library, like FAISS, for efficient retrieval. This allows for quick document search based on queries. The relevant documents are then fed into a language model (LM) as additional context, enhancing its response capability. While this workflow employs FAISS (vector library), ChromaDB (vector database), and a Hugging Face model, it’s designed to be modular, allowing for the integration of alternative tools or models as per user preference.
Imports
%pip install faiss-cpu==1.7.4 chromadb==0.3.21
This code first uses the pip command to install two separate packages — faiss-cpu==1.7.4 and chromadb==0.3.21. These packages are used for embeddings, vector databases, and search in Linear Latent Models LLM. Specifically, faiss-cpu is a library for efficient similarity search and clustering of dense vectors, while chromadb is a database for storing and searching large collections of molecular structures. By installing these packages, this code is setting up the necessary tools and dependencies for performing efficient and accurate searches in LLM.
Reading Data
import pandas as pd
pdf = pd.read_csv(f"{DA.paths.datasets}/news/labelled_newscatcher_dataset.csv", sep=";")
pdf["id"] = pdf.index
display(pdf)
This code snippet is using the pandas library to read in a CSV file containing data from a labelled newscatcher dataset. The dataset is located within the news folder in a directory specified by a variable called DA.paths.datasets. The data is separated by semicolons. Then, an id column is added to the data frame to uniquely identify each row. Finally, the data frame is displayed in a formatted table.
Vector Library: FAISS
Vector libraries are typically suitable for managing small, static datasets. Unlike full-fledged database solutions, they lack CRUD (Create, Read, Update, Delete) capabilities. This means that if any changes such as additions, removals, or edits are needed in the vector index, it necessitates a complete rebuild from the ground up.
However, vector libraries come with their advantages — they are easy to use, lightweight, and offer fast performance. Some notable examples of vector libraries include FAISS, ScaNN, ANNOY, and HNSW.
FAISS, in particular, provides various methods for similarity search, such as L2 (Euclidean distance) and cosine similarity. It offers a detailed guide on its implementation, best practices, and how it compares with other vector libraries and databases, making it a valuable resource for those interested in efficient similarity search.
from sentence_transformers import InputExample
pdf_subset = pdf.head(1000)
def example_create_fn(doc1: pd.Series) -> InputExample:
"""
Helper function that outputs a sentence_transformer guid, label, and text
"""
return InputExample(texts=[doc1])
faiss_train_examples = pdf_subset.apply(
lambda x: example_create_fn(x["title"]), axis=1
).tolist()
This code imports a library called sentence_transformers and creates a function called example_create_fn that takes in a pandas series a type of data structure and outputs a sentence_transformer guid, label, and text. It then creates a new variable called faiss_train_examples by applying the example_create_fn function to the first 1000 rows of data in a pandas dataframe called pdf_subset. This new variable is a list of InputExample objects which likely contain the text from the title column of the original dataframe. This code is likely setting up a training dataset for a machine learning task involving embedding and vector databases.