Transfer Learning and Multimodal models | AI Planet (formerly DPhi)

Transfer learning

Transfer learning is a machine learning technique where a model trained on one task or dataset is adapted and used as a starting point for a different but related task or dataset. Instead of training a new model from scratch, transfer learning leverages the knowledge gained during the training of the source model to accelerate the learning process on the target task.

Image credits: neptune.ai

In the previous Module we learnt about Fine Tuning LLMs, and there as well we learnt something similar ahout training a new model but how is Transfer learning?

Transfer learning vs Fine Tuning

Imagine you are an experienced Italian chef who has mastered the art of making various Italian pasta dishes. You've honed your skills in making pasta dough, crafting exquisite sauces, and cooking al dente pasta to perfection. One day, you decide to explore a new culinary venture: making fresh Japanese sushi.

In this scenario, transfer learning would be like taking your expertise in Italian cuisine and applying it to the world of sushi making. While Italian and Japanese cuisines are distinct, you can leverage your knowledge of handling ingredients, knife skills, and presentation aesthetics. For example, you might use your pasta-making expertise to roll sushi rolls tightly and create visually appealing presentations, even though the specific ingredients and techniques are different.

Now, let's say you've been running a successful Italian restaurant, and you have a signature pasta dish on your menu that customers adore. However, you want to introduce a seasonal variation of this dish, incorporating fresh, locally sourced ingredients. Instead of creating an entirely new dish from scratch, you decide to fine-tune your existing signature pasta recipe.

In this context, fine-tuning is like taking your trusted pasta recipe and making targeted adjustments. You might swap out certain ingredients, change the cooking time for the seasonal vegetables, or modify the sauce to complement the new flavors. The goal is to maintain the essence of the original dish while tailoring it to the specific ingredients and preferences of the season.

Transfer Learning:

In transfer learning for text data, you start with a pre-trained language model like BERT that has been trained on a massive corpus of text from the internet. This model has learned grammar, language structure, and a broad understanding of various topics.

Fine-Tuning:

Fine-tuning in text data involves taking a pre-trained language model and updating not only the final output layer but also some of the preceding layers in the model.

Look into the amazing discussion on Fine Tuning vs Transfer learning vs Learning from sctrach: StackExchange

One highlight from the above dicussion by Joel:

Reference

StackExchange

Multimodal models

Multi-modal AI models are designed to process and generate content that involves multiple modalities, such as text and images. Image captioning is a prime example of a multi-modal task where a model generates textual descriptions (captions) for images.

However, the most popular combinations are combinations of the three most popular modalities

Image + Text
Image + Audio
Image + Text + Audio
Text + Audio

Image Credits: leewayhertz

Imagine you're trying to understand a story someone is telling you. If they're using both words and gestures, you can get a better grasp of the story's meaning by considering both the words they say and the gestures they make. Multimodal models work similarly, combining different types of data to gain a deeper understanding of the information they are processing.

Three Application Use Case Projects:

Image Captioning:

Use Case: Automatically generating descriptive captions for images. How it Works: Multimodal models in this context combine the visual information from an image and the textual information from natural language descriptions. They analyze the content of the image and the context provided by the text to generate coherent and relevant captions. Example: Given an image of a cat playing with a ball, the model can generate a caption like "A playful cat chasing a ball."

Emotion Recognition in Videos:

Use Case: Analyzing emotional content in videos for applications in sentiment analysis, content moderation, and mental health monitoring. How it Works: Multimodal models process both the visual (facial expressions, body language) and auditory (tone of voice, speech content) data in a video to recognize and classify emotions, such as happiness, sadness, anger, or surprise. Example: Detecting a person's sadness by analyzing their facial expressions and listening to the melancholic tone in their voice in a video clip.

Healthcare Diagnosis with Electronic Health Records (EHRs):

Use Case: Assisting doctors in making accurate diagnoses and treatment recommendations based on patients' medical records, which include text notes, lab results, and medical images. How it Works: Multimodal models process the textual data from patient records and the visual data from medical images like X-rays or MRIs. By combining these modalities, they can help identify patterns and correlations that might be missed by examining each modality separately. Example: A multimodal model can assist a radiologist in diagnosing lung cancer by analyzing both the patient's medical history and the X-ray images, increasing diagnostic accuracy.