Source

Source refers to the origin of the RAG pipeline's data which is the initial step in building a RAG pipeline. The beyondllm.source module offers a range of loaders to ingest and process data from diverse sources, enabling easy integration of your data into the RAG pipeline for question answering and information retrieval.

The central function for loading data is fit, which provides a unified interface for data loading and processing, regardless of the source type. The fit function includes a dtype parameter that supports various data types such as pdf (pdf), comma-separated values (csv), word documents (docx), Readme files (md), epub (epub), PowerPoint presentations (ppt, pptx, pptm), web pages (url), YouTube videos (youtube), LlamaParse Cloud API (llama-parse), and Notion pages (notion). The words in brackets denote the specific keyword for each file type. Additionally, the fit function requires the file path in the "path" parameter and takes "chunk_size" and "chunk_overlap" for splitting specifications. In the hindsight, there are specific loaders (like simpleLoader, UrlLoader, etc.) designed to handle different types of data, such as PPT files, web pages, and other data formats separately.

Embeddings

Similar to other RAG pipelines, we require embeddings for the loaded data. BeyondLLM provides several embedding models with different characteristics and performance levels. The default embedding model is the Gemini Embedding model. Here's an overview of the available options:

  • Hugging Face Embeddings - This option uses models from the Hugging Face Hub directly. This will load the model and work on your data locally. The parameters are "model_name" and "api_key" (for huggingface access token).
  • OpenAI Embeddings - This option uses OpenAI's embedding models, known for their quality and performance. The parameters are "api_key" and "model_name" (which can be text-embedding-ada-002, for instance).
  • Qdrant Fast Embeddings - This option utilizes fast and efficient embedding models optimized for Qdrant: a vector similarity search engine. It takes only one parameter, which is "model_name".
  • Gemini Embeddings - Finally, the default embedding model for our Auto Retriever: Leverages Google's powerful Gemini Text Embedding model using the default model name to be: models/embedding-001, it offers a robust solution for generating text representations within BeyondLLM. The parameters are "api_key" and "model_name" (The default is models/embedding-001).

Selecting the best embedding model depends on your specific requirements, such as desired accuracy, performance, cost, and integration preferences. BeyondLLM also enables you to fine-tune embedding models with your own data for improved accuracy. You can fine-tune any model from Hugging Face by preparing a list of files and using the train function, specifying the model's output path.

Vector Databases

Vector databases in BeyondLLM (which store the vector embeddings) are optimized for similarity search, making them essential for effective RAG applications. Available Vector Databases are:

  • Chroma - It is a powerful and purpose-built vector database designed for high-performance similarity search and accessibility. Parameters for beyondllm.vectordb.ChromaVectorDb are: collection_name (name of the collection within Chroma to store your embeddings) and persist_directory (optional directory path to persist the Chroma database on disk).
  • Pinecone - Pinecone is a fully managed vector database service designed to provide high performance and scalability for similarity search applications. Parameters for beyondllm.vectordb.PineconeVectorDb are: api_key - Pinecone API key for accessing the service.

index_name - The name of the index within Pinecone where embeddings will be stored.
create - Whether new store should be created or not.
embedding_dim - The dimensionality of the embedding vectors. required if create=True.
metric - The distance metric used for similarity search. Eg: euclidean or manhattan. required if create=True. spec - The deployment specification. Options are "serverless" (default) or "pod-based". cloud - The cloud provider for your serverless Pinecone index. region - The region for your serverless Pinecone index. pod_type - The pod type for your dedicated Pinecone index. replicas - The number of replicas for your dedicated Pinecone index.

Choosing the right vector database depends on several factors, including scale and performance, persistence, and features. We need to consider the expected size of the embedding data and the need to persist the embedding data. Additionally, we need to evaluate advanced features like filtering, indexing, and scalability.

Auto Retriever

Retrievers in BeyondLLM use the generated embeddings to perform similarity search and identify the most pertinent documents or passages. We call this function the auto retriever, since it abstracts away all the complexity and allows you to define your retrieval type and rerankers all in one line. The auto_retriever function from beyondllm.retrieve allows you to set your retriever model. It allows seamless integration with vector databases, streamlining the retrieval process. BeyondLLM provides several retriever types, each offering distinct approaches to information retrieval:

  • Normal Retriever: Suitable for straightforward retrieval tasks where basic semantic similarity is sufficient. Parameters are "data" (The dataset containing the text data), "embed_model" (The embedding model used to generate embeddings for the data), and "top_k" (The number of top results to retrieve).
  • Flag Embedding Reranker Retriever: This retriever enhances the normal retrieval process by incorporating a "flag embedding" reranker. The reranker further refines the initial results by considering the relevance of each retrieved document to the specific query, potentially improving retrieval accuracy. Parameters are "data", "embed_model", "top_k", and "reranker" (The name of the flag embedding reranker model).
  • Cross Encoder Reranker Retriever: This retriever uses a cross-encoder model to rerank the initial retrieval results. Cross-encoders directly compare the query and document embeddings, often leading to more accurate relevance assessments. Parameters are same as Flag Embedding Reranker Retriever. Remember that these type of retrievers are useful when higher accuracy is required and computational resources allow for reranking.
  • Hybrid Retriever: This retriever combines the strengths of both vector similarity search and keyword-based search. It is beneficial when dealing with diverse queries or when keyword relevance is important alongside semantic similarity. Parameters are "data", "embed_model", "top_k", and "mode" (Determines how results are combined. Options are AND – intersection of results & OR – union of results).

Evaluating Retrievers

Evaluating Retrievers help in measuring the retrieval quality, comparing the performance of different retrievers, and optimizing the performance of the retrievers. BeyondLLM offers two key metrics for retriever evaluation:

  • Hit Rate - This metric represents the percentage of queries where the retriever successfully retrieves at least one relevant document from the knowledge base.
  • Mean Reciprocal Rank (MRR) - This metric considers the ranking of relevant documents within the retrieved results. It calculates the reciprocal of the rank of the first relevant document for each query and averages these values across all queries.

The retriever.evaluate(llm) function facilitates the evaluation process by automatically generating question-answer pairs from your data using the provided Large Language Model (LLM). These QA pairs are then used to assess the retriever's performance based on the hit rate and MRR metrics. Important Considerations for using evaluation - Generating QA pairs involves multiple calls to LLMs, which can be time-consuming and resource-intensive, especially considering the size of your knowledge base and the number of questions generated per text segment. Therefore, it's essential to ensure that the LLM you use has strong question generation capabilities and is well-suited to the domain and content of your knowledge base.

LLM

A Large Language Model (LLM) is a critical component of a RAG pipeline (as discussed in the prior modules). BeyondLLM uses Google's Gemini as the default LLM, requiring only a Google API Key with Gemini-Pro being the default model. Other LLMs the can be used with BeyondLLM are:

  • ChatOpenAI - This allows you to use state-of-the-art models (closed-source) provided by OpenAI. However, this functionality requires an API key, which is available through a paid subscription. The "model_name" parameter specifies the model name, like GPT3.5 and GPT4.
  • HuggingFaceHubModel - The Hugging Face Hub is a platform offering over 350K overall models and 96K LLMs. The only parameters needed to access these models in BeyondLLM are an access token and the model name.
  • Ollama - Ollama allows you to run models locally and integrate them into your application. To get started, download Ollama and pull the model you need. Before running the OllamaModel, ensure the model is running locally on your terminal, and then specify the model name in the function.
  • AzureOpenAIModel - The Azure OpenAI service provides REST API access to OpenAI's powerful language models including the GPT-4, GPT-3.5-Turbo, and Embeddings model series. The parameters are: AzureChatOpenAI API Key - Azure api key for AzureChatOpenAI service. Deployment Name - The deployment name that is created on Model deployments on Azure. Endpoint Url - The endpoint url. Model Name - Name of the model. Eg: GPT-4. Max Tokens - The maximum sequence length for the model response. Temperature - It can be used to control the randomness or creativity in responses.

Given that each LLM offers unique functionalities, it's important to consider what you need. For instance, if you want to experiment with various models, the Hugging Face Hub is a good choice. If you prefer running models locally, Ollama is the suitable option.