Build end-to-end RAG with Gemma and GenAI Stack Studio

Feb 25, 2024

Want to build an LLM application without writing a single line of code?

We've got you covered. In this article, we will build an end-to-end RAG application using AI Planet’s GenAI Stack and Google’s Gemma.

GenAI Stack

We’re super excited to introduce the GenAI Stack Studio, our latest effort to make the development of LLM apps and agents accessible to everyone. As a comprehensive end-to-end GenAI platform, we simplify creating production-grade LLM applications through an intuitive drag-and-drop interface.

We believe in empowering communities and making it accessible to everyone to rapidly bring their ideas to life — from prototype to production seamlessly.

Documentation: http://docs.aiplanet.com/

Gemma

Google has recently reentered the open-source arena after a prolonged absence. While the tech giant has previously contributed notable projects such as TensorFlow, Keras, and JAX, as well as groundbreaking models like BERT and T5, the development of an open-source large language model (LLM) was conspicuously absent from Google’s repertoire — until now.

Last week, Google unveiled Gemma, a collection of lightweight, state-of-the-art open models developed using the same cutting-edge research and technology employed in the creation of the Gemini models. The Gemma model is now offered in two variants: a 2B parameter model and a 7B parameter model, accompanied by two instruction fine-tuned models for each variant, namely 2B-it and 7B-it.

Join the competition, to understand Gemma in more depth: https://www.kaggle.com/competitions/data-assistants-with-gemma/

Build End-to-End RAG application using GenAI Stack

Let’s build our stack.

Data Loading and Chunking

The initial step involves selecting the suitable loader depending on your specific use case. This can include options like PDF, Document, URL, YouTube, subtitles, etc., depending on your requirements. In our case, we choose pdf loader.

After loading the content, we divide the document into smaller segments. This segmentation is necessary because it enables us to provide users with relevant information from the appropriate chunk when they query any prompt.

Data used: In this stack, Half-life regression paper by Duolingo.

Indexing

Document indexing requires generating embeddings for each section and storing them in a vector database. It’s important to use unique collection names when using a drag to any vector store, and enabling persistence eliminates the need to recreate indexes for existing content.

Vector storage is employed to store document embeddings, with similarity search being utilized for retrieval in this scenario.

Ensemble Retriever

RAG comprises two main components: the Retriever and the Generator. The Retriever retrieves relevant information based on the user query. In this context, the Vector store functions as a Retriever, conducting semantic search and providing a similar document.

We will employ an ensemble retriever, specifically a Hybrid Search retriever. This retriever seamlessly integrates both the Keyword Retriever (BM25) and the Vector Store Semantic Retriever, representing a crucial step in the process.

Large Language Model- Gemma

Once the Retriever identifies relevant information from the index according to the user’s query, this pertinent data, along with the prompt, is passed to the Large Language Model (LLM), which functions as the generator within the RAG pipeline. This is where Gemma comes into play.

Chain

The chain component combines the retriever and generator and completes the stack.

Execute your Stack

After ensuring the proper connection of all components, initiate the stack by clicking on the run icon (⚡) to compile the connections. Upon a successful build, navigate to the chat icon to commence interaction.

Chat Interface

The chat interface of GenAI Stack offers a user-friendly experience and functionality for interacting with the model and customizing the prompt. The sidebar provides options that allow users to view and edit pre-defined prompt variables (queries). This feature facilitates quick experimentation by enabling the modification of variable values directly within the chat interface.

Feel free to connect with me on Twitter: https://twitter.com/TRJ_0751