Summarizing viral videos with local AI models

a search engine for video memes


📅 july 2024


This post describes a side project leveraging VILA, Llama 3, and Whisper to generate local (e.g. on my home computer) summaries of social media videos. The resulting metadata was then indexed into a local instance of meilisearch.

Here's a video of the project:

background

It's mid-2024, and generative AI remains in fashion - GPTs are being integrated into applications at a dizzying rate. Last year, someone even tried a ChatGPT-led church service.

I like to experiment with local/downloadable models - these can be freely run on consumer machines and the data stays on-device.

Earlier this year, Meta released Llama 3. NVIDIA and MIT also released VILA, a visual language model. I wanted to try them - so like many software endeavors, I picked the tech first and looked for a adjacent problem afterwards. 🙂 I tried local summarization of my favorite social media videos. Then I ingested them to a meilisearch instance for proof-of-concept search.

Most of this took place in early May - I remember GPT-4o was released the week after.

example summary

Here's what the full pipeline produced on this adorable video.

Transcription (via Whisper)
Fam listen, it's Quentin vs. Beto. We playing different
I can mention that you poop without warning so keep an arm in the hand distance that baby face won't help you
Even if you had a grand entrance since I'm giving you bars
That means your new name is San Quentin
You messing with me?
Yo, get your mans, bro.
Get your...
Get your mans.
Get your mans, yo.
Wow, get your mans.
Yo.
What's your life like, Quentin?
Nah, nah, nah.
What's your life like, Quentin?
What's your life like, Quentin? Nah, nah, nah. What's your life like, Quentin? What's your life like, Quentin?
Every day you get pocket checked.
You eat like a king.
Two pair of households.
And you get lots of rest.
Yo, you're not a threat.
Get your mans, bro.
Get your mans, bro.
You can't come at me with gangster talk.
You can't come at me with gangsta talk.
You can't come at me with gangsta talk.
Claiming you poppin' tech.
I bag your mom.
I tell you when to go to sleep.
I don't have any losses yet.
Bro, what's so funny, bro?
This is strictly tech.
You thought you would take me on?
...
Analysis (via Whisper / Llama 3 / VILA)
The video shows an adult engaging in a playful rap battle with a baby. The adult, dressed in a gray shirt, stands facing the baby, who is seated in a blue high chair wearing white clothing. The room has a light color scheme with white walls and a couch featuring a white cover and green pillow.

The context is a rap battle between the adult (Quentin) and the baby (Beto). The adult uses baby talk and slang to engage in a playful battle, while the baby responds with coos and laughter. Subtitles suggest that the adult is poking fun at the idea of a baby being a rival in a rap battle.

Key moments include references by the adult to the baby's diaper changes, sleep patterns, and perceived lack of threats. The baby responds with coos and laughter, while the adult continues to engage in the rap battle.

The overall tone of the video is playful and lighthearted, with the adult using humor and wordplay to engage in a fun battle with the baby.

model overview

This section describes the three models.

VILA (encodes visual info for LLM)

VILA is among a growing family of visual language models released by NVIDIA and MIT. Some VILA/TinyChat demos are reproduced below:

a demo showing a terminal where VILA explains the condition of a building based on its image

VILA also supports input frame sequences (and therefore video). Their demo below shows three frames in sequence:

a demo showing a terminal where VILA explains someone's daily routine based on three pictures of their activities at different times

NVIDIA enabled AWQ 4-bit quantization. Coupled with TinyChat inference, it means VILA can run on RTX GPUs - engineers looking to experiment with deep learning or AI at home often get NVIDIA GeForce RTX 3090s or 4090s. I used Efficient-Large-Model/Llama-3-VILA1.5-8B.

Llama 3 (large language model)

This is Meta's large language model - you can find the model card here. As of this week, Meta has just released Llama 3.1. Ollama is an easy way to experiment with LLMs; it also offers a python library for scripting. I used the 8b variant.

Whisper (speech to text model)

This is OpenAI's automatic speech recognition model. I used whisper-large-v3 and followed the snippet from insanely-fast-whisper to leverage FlashAttention.

generating summaries

setting a baseline

I tried VILA 'out of the box' against some example videos. It took some time to get setup locally. I ended up switching to wsl. Then I followed these steps from the tinychat repository. I modified the driver code here. The suggested prompt is "Elaborate on the visual and narrative elements of the video in detail" (reference gradio code). On example videos, I learned the following:

The baseline was reasonable, so I explored two updates:

long videos

The summaries were effective with up to eight frames. I decided to take one frame per second. This meant each video would be split into eight-second segments. I ran each segment through VILA separately to get an individual summary. For many videos these summaries were similar; but for fast-changing videos, the extra granularity provided extra context. I combined adjacent summaries in a later step. Here's a diagram of the process:

diagram showing how videos are split into multiple segments and each is summarized separately using VILA before consolidating later

Here's a real-time video of summary generation. You can see adjacent segment summaries are concatenated into arrays for later processing.

using audio

For commentary-heavy videos, the audio was critical. I decided to transcribe using OpenAI's Whisper and place the resulting text alongside the VILA visual summaries for post-processing. My code looked something like this. Whisper worked fairly well overall, with the following notes:

Consolidating adjacent summaries

At this point, I had the following data for each video:

Adjacent summaries were highly redundant for homogenous videos, and the transcription spanned different segments. So I used Llama 3 in isolation to create a holistic summary reflecting the important parts of the video. I used the Ollama Python library.

First, I tried the naive approach of providing Llama 3 with a single "do-everything" prompt - e.g. remove redundant information from adjacent segment summaries, leverage the transcription, and create an overall description of the video. Understandably, this was a lot of information and didn't work very well. The varied transcription relevance would sometimes overshadow the video input. I tried a few different approaches and landed on the following prompt orchestration:

Role Prompt or response
USER I am watching a video in 8-second increments and summarizing it. You have two jobs:
  • Can you please take in the new information for each segment, and combine it with the old information, to provide an overall description of the video that detailed but has no redundant information?
  • If I tell you about any text overlays / subtitles, return those altogether. Your responses will be indexed for searching.
ASSISTANT response
USER summary 1 of n
ASSISTANT initial summary
USER summary 2 of n
ASSISTANT updated summary
and so on...
USER summary n of n
ASSISTANT updated summary
USER That was the last one. Please consolidate your findings.
ASSISTANT consolidated summary
USER I also tried transcribing this video, but there might be mistakes. I will give you the transcription in case it helps to contextualize your earlier analysis, but do not change too much. Please re-state or slightly update your findings.

provided transcription
ASSISTANT final summary

The above orchestration generally produces a reasonable consolidated summary that takes the transcription into account. I noticed the style was varied; sometimes the LLM produced first-person text. I again asked Llama 3 to massage the text, but this time with fresh history (e.g. new agent / no memory of previous conversation):

Role Prompt or response
USER The following video summary was generated by AI. Remove any language where the AI is speaking in first-person or meta-describing its analysis process. Preface your writing with 'ANALYSIS:'

provided summary
ASSISTANT response

Function calling or more stringent formatting would be better, but this worked well enough.

indexing

Now I had a consolidated summary and full transcription for each video. To make the metadata searchable, I used meilisearch to locally index and visualize the summaries. I followed the quick start docs and modeled my metadata after their movies.json. I slightly customized their built-in search preview and got things running on localhost. The full-text search is not as sophisticated as other alternatives, but worked well for a proof-of-concept.

odds and ends

If you liked this, I also recommend I accidentally built a meme search engine by Harper Reed. It focuses on static images and is a great read. I tried running some thumbnails through CLIP and saw similar results.

If I had more time, I might try these other ideas:

There are likely better ways to solve this problem from first principles - but this project was a nice way for me to locally experiment with these models.

Thanks for reading! You can find other posts and contact info here.