The emergence of models such as DALL.E and GPT-4 has transformed human-machine interactions through multimodal AI. Multimodal AI represents a paradigm in AI, unlike traditional models that interpret data from various modalities (or data types). This method uses contextual understanding with the help of numerical embeddings, thereby helping in making informed decisions and generating answers.
In this article, we will implement text-2-speech and text-2-image using our GenAI Stack.
Multimodal
In generative AI, multimodal refers to models capable of understanding and generating content across different data types or ‘modalities’. These models can process and integrate information from diverse sources like text, images, audio, and video, leading to more comprehensive and diverse results. They use multi-dimensional embeddings to handle and output data from these types.
For example, OpenAI’s GPT-4 is a multimodal model capable of understanding both text and images. This opens up numerous use cases, as multimodal models can perform tasks that text-only or image-only models cannot. For instance, GPT-4 can provide visual instructions for debugging code within a web design framework, making these complex tasks easier to perform, by identifying the contents of an image.
Learn more about: Building Multimodal RAG
How do these models work?
These types of models can be divided into three major categories:
Input module — The input module comprises neural networks that handle and preprocess various types of data individually. Depending on the specific modality, this module employs techniques like natural language processing or computer vision. For instance, when dealing with text data, it may utilize methods such as tokenization, stemming, and part-of-speech tagging to extract relevant information. Conversely, when processing image data, convolution and pooling layers are employed to extract information.
Fusion module — Following extraction, the fusion module is engaged to merge information from various modalities like text, images, audio, and video. This module can vary in form, ranging from basic operations like concatenation to more intricate methods such as attention mechanisms or graph convolutional networks. The fusion module aims to capture pertinent information from each modality and integrate it in a manner that exploits the strengths of each.
Output module — The output module receives the fused information and produces a conclusive output or prediction in a format relevant to the given task. For instance, in a scenario where the task involves classifying an image using both its content and a textual description, the output module could generate a label or a list of ranked labels representing the most probable classes to which the image belongs.
Multimodals in GenAI stack
Here, we will be using AI Planet’s GenAI stack, a developer platform simplifying the creation of production-grade Large Language Model (LLM) applications with an intuitive drag-and-drop interface.
Firstly, open “app.aiplanet.com” in your browser, log in and create a new stack.
Then you need to click on the “New Stack” button and choose the “Text Generation” stack type.
The Multi-modal component in GenAI is a specialized custom component that uses models, such as diffusion models or speech generators, to create images or voice outputs from a simple text prompt.
Currently, we have two Multi-Modals available: OpenAITextToImage and OpenAITextToSpeech.
OpenAITextToImage — This feature utilizes the DALL-E-3 model from OpenAI. It requires an OpenAI API key (available from https://platform.openai.com/) for authentication and access to the OpenAI API. Additionally, a prompt and the desired visual definition (standard or HD) of the image are required for image generation.
OpenAITextToSpeech — This feature employs the TTS-1 model from OpenAI as its foundation. Users can generate speech with a designated vocal tone using this component. It mandates an OpenAI API key, along with a text input for speech conversion. Additionally, there is an option to select the desired vocal tone for the generated speech.
Text-to-Image
To build the Text-to-Image stack, follow the steps mentioned below:
Drag the OpenAITextToImage component from the Multimodals section on the left pane, and the TextGeneration component from the “Outputs” section.
We use the “TextGenerationOutput” component to store and manage the output generated by the Multimodal component. This component ensures that the generated output is easily accessible to users. Connect both of the components by linking the output of the Multimodal component to the input labeled “Multimodal” of the TextGenerationOutput component.
Next, drag and drop the PromptTemplate from the Prompts section. This will define the prompt used in the Multimodal component. We can also declare variables in the prompt by enclosing them within {} and using an external Input to declare them.
Define the Input value (in the input component) with a text type, ensuring that the variable name used in the prompt matches the Input key.
After connecting all the components, the stack layout should resemble the provided diagram:
Generate the desired image by clicking the build button and then the text-generation button. Here’s our output.
Text-to-Speech
To build the Text-to-Speech stack, follow the steps mentioned below:
Begin by dragging the OpenAITextToSpeech component from the Multimodals section and the TextGeneration component from the Outputs section, then connect them.
Next, configure the Text Input within the multimodal component with your desired text, noting that only this sentence will be converted to speech. You can also specify the preferred voice, which defaults to “alloy”.
Once all components are connected, the stack configuration should resemble the provided example:
Note — We can also use an Input component to define the “text input” separately.
Subsequently, generate the desired speech by clicking the build button and then selecting the text-generation button. The process will automatically generate the speech, which you can either play or download using the dedicated buttons provided.
Thus, we have developed a multimodal stack, simply by inputting a basic text prompt. If you encounter any challenges in building any of the stacks, you can refer to the YouTube video linked below.
Try out and experiment with this use case in our GenAI stack —
Explore other use cases and stacks —
Official Documentation of GenAI stack —