Multimodal RAG using LlamaIndex, Gemini and Qdrant

Mar 29, 2024

In this article, we will implement the Indian Tourist Places recommendation, a Multimodal RAG application using LlamaIndex, Gemini Pro Vision and Qdrant as the knowledge base.

What is Multimodal?

Multimodal refers to the integration of multiple modes of communication or information processing, typically involving two or more sensory channels such as text, images, audio, or video. In the context of technology and artificial intelligence, multimodal systems aim to understand and generate content across these various modalities, allowing for richer and more versatile interactions with users.

With the growing popularity of the Large Language Models, we have seen multiple applications being built using Retrieval Augmented generation.. RAG is a concept that contains two models: Retriever and Generator which helps in information retrieval and text generation NLP tasks.

The process includes:

The user asks a question or provides an instruction.
The system queries a vector database, that performs the search technique and retrieves the relevant document. (R-Retrieval)
The user’s prompt and any relevant information from the vector database are supplied to the language model in terms of in-context learning (A- Augmentation).
The LLM based on this prompt, generates a response to the user (G- Generation).

However, the approach to building a Multimodal RAG is quite different. Let's dig deeper into it.

Pydantic Class LlamaIndex Output Parser

Typically there are various approaches to building a Multimodal RAG,

Use a Multimodal LLM such as Gemini-Pro-Vision, GPT-4V, LLaVA, or FUYU-8b to produce text summaries from images. Once the image summaries are generated, embed and retrieve image summaries with a reference to the raw image. In this approach both image and text can be retrieved.
Use a multimodal embedding model such as CLIP to embed images and text inside the vector database and then use a multi-vector retriever to generate the response.
Use a Multimodal LLM such as Gemini-Pro-Vision, GPT-4V, LLaVA, or FUYU-8b to produce text summaries from images. Here we embed and retrieve text only. The approach we use in this article is quite similar to this approach, we only retrieve text from the generated image summaries.

In our approach, we use the Pydantic class output parser from LlamaIndex to extract information from an image using Gemini Pro Vision Multimodal. This information can be extracted in the form of a summary or any other requirements based on the prompt provided. One must pass the class attributes within the Pydantic class, aiding the Gemini model in performing image reasoning and generating the output in a structured format.

Building Indian Places Tourist Places Recommendation

Let's dirty our hands now.

Installation

We need to install LlamaIndex, Google Generative AI (Gemini) and Qdrant for storing embeddings.

!pip install llama-index
!pip install 'google-generativeai>=0.3.0' qdrant_client
!pip install llama-index-multi-modal-llms-gemini
!pip install llama-index-vector-stores-qdrant
!pip install llama-index-embeddings-gemini

Setup Gemini API

Get your API from here: https://ai.google.dev/

import os
from getpass import getpass
from llama_index.multi_modal_llms.gemini import GeminiMultiModal
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.core.schema import TextNode
from llama_index.core import SimpleDirectoryReader

GOOGLE_API_KEY = getpass()
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

Check with the available models for Gemini support for text generation tasks.

import google.generativeai as genai

for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)

Output:

Data Loading and extraction

Download a few PNG images of Famous Indian places and build a knowledge base. Create a folder save the images inside it, and use SimpleDirectoryReader to load the images.

documents = SimpleDirectoryReader("./indian_places")
documents = documents.load_data()

Create Pydantic for output class

Pydantic is used as the extraction class to retrieve the image information in appropriate JSON format as the output parser.

from pydantic import BaseModel

class Indian_Places(BaseModel):
    city_name: str
    state_name: str
    famous_food: str
    history: str
    review: str
    description: str
    nearby_tourist_places: str

Extract information using Gemini Pro Vision

Initialize the Gemini Pro Vision model and write a prompt to extract the summary or information as defined in the Pydantic output class. Pass each image and make it verbose to be True so that you can visualize the response generated. This will generate the raw output that we later need to parse through the node as supported by LlamaIndex indexing.

prompt_template_str = """\
    You are an AI assistant your job is to summarize images, tables and text CONTEXT for retrieval \
    You MUST treat this job is coherent and honestly  \
    You MUST return the answer with json format \
"""

def pydantic_gemini(
    model_name, output_class, image_documents, prompt_template_str
):
    gemini_llm = GeminiMultiModal(model_name=model_name)
    llm_program = MultiModalLLMCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_class),
        image_documents=image_documents,
        prompt_template_str=prompt_template_str,
        multi_modal_llm=gemini_llm,
        verbose=True,
    )
    response = llm_program()
    return response

results = []
for img_doc in documents:
    pydantic_response = pydantic_gemini(
        "models/gemini-pro-vision",
        Indian_Places,
        [img_doc],
        prompt_template_str,
    )
    results.append(pydantic_response)

Verbose result:

As you can see below, we have the response according to the required output format.

> Raw output:  {
  "city_name": "Coimbatore",
  "state_name": "Tamil Nadu",
  "famous_food": "South Indian",
  "history": "Coimbatore is a city in the Indian state of Tamil Nadu. It is the second largest city in the state after Chennai. Coimbatore is known for its textile industry and is often referred to as the \"Manchester of South India\". The city is also home to several educational institutions and research centers.",
  "review": "Coimbatore is a beautiful city with a rich history and culture. The city is home to several temples, mosques, and churches. The climate is tropical and the city experiences hot summers and mild winters. The city is well-connected by air, rail, and road. Coimbatore is a major industrial and commercial center and is home to several large corporations. The city is also a major educational center and is home to several universities and colleges.",
  "description": "Coimbatore is a city in the Indian state of Tamil Nadu. It is the second largest city in the state after Chennai. Coimbatore is known for its textile industry and is often referred to as the \"Manchester of South India\". The city is also home to several educational institutions and research centers.",
  "nearby_tourist_places": "Coimbatore is home to several tourist attractions, including the Adiyogi Shiva statue, the Perur Pateeswarar Temple, and the VOC Park and Zoo."
> Raw output:  {
 "city_name": "Agra",
 "state_name": "Uttar Pradesh",
 "famous_food": "Petha",
 "history": "Agra was the capital of the Mughal Empire from 1526 to 1658. It was also the capital of the Sur Empire from 1540 to 1556.",
 "review": "Agra is a beautiful city with a rich history. The Taj Mahal is one of the most iconic buildings in the world and is a must-see for any visitor to India.",
 "description": "Agra is a city on the banks of the Yamuna River in the state of Uttar Pradesh, India. It is the fourth-most populous city in Uttar Pradesh and the 23rd-most populous city in India.",
 "nearby_tourist_places": "Fatehpur Sikri, Akbar's Tomb, Agra Fort"
}
> Raw output:  {
  "city_name": "New Delhi",
  "state_name": "NCT",
  "famous_food": "Chole Bhature",
  "history": "New Delhi is the capital of India and a major tourist destination. It is home to many historical monuments, including the Red Fort, the Jama Masjid, and the Qutub Minar.",
  "review": "New Delhi is a vibrant and exciting city with a lot to offer visitors. There are many things to see and do, and the food is delicious. I would highly recommend visiting New Delhi to anyone who is interested in learning more about India.",
  "description": "New Delhi is a city that is full of history and culture. There are many things to see and do, and the food is delicious. I would highly recommend visiting New Delhi to anyone who is interested in learning more about India.",
  "nearby_tourist_places": "The Red Fort, the Jama Masjid, and the Qutub Minar."
}
> Raw output:  Here's a JSON object based on the image:

{
  "city_name": "New Delhi",
  "state_name": "NCT",
  "famous_food": "Chole Bhature",
  "history": "New Delhi is the capital of India and is known for its rich history and culture. It is home to many historical monuments, including the Red Fort, the Jama Masjid, and the Qutub Minar.",
  "review": "New Delhi is a great city to visit, with something to offer everyone. There are many historical monuments to explore, as well as a variety of museums, art galleries, and shopping malls. The city is also home to a number of parks and gardens, which are perfect for relaxing and enjoying the outdoors.",
  "description": "New Delhi is a vibrant and exciting city that is full of life. There is always something going on, and there is always something new to see or do. The city is also home to a diverse population, which makes it a great place to learn about different cultures.",
  "nearby_tourist_places": "There are many tourist places to visit in New Delhi, including the Red Fort, the Jama Masjid, the Qutub Minar, the Lotus Temple, the India Gate, and the Rashtrapati Bhavan."
}
> Raw output:  {
  "city_name": "Jaipur",
  "state_name": "Rajasthan",
  "famous_food": "Dal Baati Churma",
  "history": "Jaipur was founded in 1727 by Maharaja Sawai Jai Singh II, the ruler of Amber.",
  "review": "Jaipur is a beautiful city with a rich history and culture.",
  "description": "Jaipur is the capital of Rajasthan and is known as the Pink City due to the color of its buildings.",
  "nearby_tourist_places": "Amber Fort, Nahargarh Fort, Jaigarh Fort, Hawa Mahal, City Palace, Jantar Mantar"
}

Node parsing

To create a vector index, we need to pass the Raw image summary output to the TextNode that does the parsing of the document.

nodes = []
for res in results:
    text_node = TextNode()
    metadata = {}
    for r in res:
        if r[0] == "description":
            text_node.text = r[1]
        else:
            metadata[r[0]] = r[1]
    text_node.metadata = metadata
    nodes.append(text_node)

Store embeddings into Qdrant

LlamaIndex supports StorageContext that uses some default vector database. To use Qdrant as the knowledge base we update it using Qdrant Client and QdrantVectorStore.

from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
import qdrant_client


client = qdrant_client.QdrantClient(path="qdrant_gemini_3")
vector_store = QdrantVectorStore(client=client, collection_name="collection")

LLM and Embedding Model

By default, LlamaIndex uses OpenAI LLM and embedding, to update the default embed_model and LLM we use Settings from LlamaIndex core.

from llama_index.core import Settings
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini

Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY)

Build Query Engine to retrieve the relevant document

Combine the vector store and the node parser and create the index. Vector store index as the query engine that performs the similarity search and retrieves the `k`documents.

from llama_index.core import VectorStoreIndex, StorageContext

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex(
    nodes=nodes,
    storage_context=storage_context,
)
query_engine = index.as_query_engine(
    similarity_top_k=1,
)

Run User Query

In the final step, we just user query and run the query engine.

response = query_engine.query(
    "which place belongs to Coimbatore from the given context, and tell about that given place history. Also tell whats the best food one can eat there?"
)
print(response)

Output

Reference: LLamaIndex documentation: https://docs.llamaindex.ai/en/stable/examples/multi_modal/gemini/

LinkedIn: https://www.linkedin.com/in/jaintarun75/

GitHub: https://github.com/lucifertrj/

Twitter: https://twitter.com/TRJ_0751