a search engine for video memes
📅 july 2024
This post describes a side project leveraging VILA, Llama 3, and Whisper to generate local (e.g. on my home computer) summaries of social media videos. The resulting metadata was then indexed into a local instance of meilisearch.
Here's a video of the project:
It's mid-2024, and generative AI remains in fashion - GPTs are being integrated into applications at a dizzying rate. Last year, someone even tried a ChatGPT-led church service.
I like to experiment with local/downloadable models - these can be freely run on consumer machines and the data stays on-device.
Earlier this year, Meta released Llama 3. NVIDIA and MIT also released VILA, a visual language model. I wanted to try them - so like many software endeavors, I picked the tech first and looked for a adjacent problem afterwards. 🙂 I tried local summarization of my favorite social media videos. Then I ingested them to a meilisearch instance for proof-of-concept search.
Most of this took place in early May - I remember GPT-4o was released the week after.
Here's what the full pipeline produced on this adorable video.
This section describes the three models.
VILA is among a growing family of visual language models released by NVIDIA and MIT. Some VILA/TinyChat demos are reproduced below:
VILA also supports input frame sequences (and therefore video). Their demo below shows three frames in sequence:
NVIDIA enabled AWQ 4-bit quantization. Coupled with TinyChat inference, it means VILA can run on RTX GPUs - engineers looking to experiment with deep learning or AI at home often get NVIDIA GeForce RTX 3090s or 4090s. I used Efficient-Large-Model/Llama-3-VILA1.5-8B.
This is Meta's large language model - you can find the model card here. As of this week, Meta has just released Llama 3.1. Ollama is an easy way to experiment with LLMs; it also offers a python library for scripting. I used the 8b variant.
This is OpenAI's automatic speech recognition model. I used whisper-large-v3 and followed the snippet from insanely-fast-whisper to leverage FlashAttention.
I tried VILA 'out of the box' against some example videos. It took some time to get setup locally. I ended up switching to wsl. Then I followed these steps from the tinychat repository. I modified the driver code here. The suggested prompt is "Elaborate on the visual and narrative elements of the video in detail" (reference gradio code). On example videos, I learned the following:
The baseline was reasonable, so I explored two updates:
The summaries were effective with up to eight frames. I decided to take one frame per second. This meant each video would be split into eight-second segments. I ran each segment through VILA separately to get an individual summary. For many videos these summaries were similar; but for fast-changing videos, the extra granularity provided extra context. I combined adjacent summaries in a later step. Here's a diagram of the process:
Here's a real-time video of summary generation. You can see adjacent segment summaries are concatenated into arrays for later processing.
For commentary-heavy videos, the audio was critical. I decided to transcribe using OpenAI's Whisper and place the resulting text alongside the VILA visual summaries for post-processing. My code looked something like this. Whisper worked fairly well overall, with the following notes:
At this point, I had the following data for each video:
Adjacent summaries were highly redundant for homogenous videos, and the transcription spanned different segments. So I used Llama 3 in isolation to create a holistic summary reflecting the important parts of the video. I used the Ollama Python library.
First, I tried the naive approach of providing Llama 3 with a single "do-everything" prompt - e.g. remove redundant information from adjacent segment summaries, leverage the transcription, and create an overall description of the video. Understandably, this was a lot of information and didn't work very well. The varied transcription relevance would sometimes overshadow the video input. I tried a few different approaches and landed on the following prompt orchestration:
Role | Prompt or response |
---|---|
USER |
I am watching a video in 8-second increments and summarizing it. You have two jobs:
|
ASSISTANT | response |
USER | summary 1 of n |
ASSISTANT | initial summary |
USER | summary 2 of n |
ASSISTANT | updated summary |
and so on... | |
USER | summary n of n |
ASSISTANT | updated summary |
USER | That was the last one. Please consolidate your findings. |
ASSISTANT | consolidated summary |
USER | I also tried transcribing this video, but there might be mistakes. I will give you the transcription in case it helps to contextualize your earlier analysis, but do not change too much. Please re-state or slightly update your findings. provided transcription |
ASSISTANT | final summary |
The above orchestration generally produces a reasonable consolidated summary that takes the transcription into account. I noticed the style was varied; sometimes the LLM produced first-person text. I again asked Llama 3 to massage the text, but this time with fresh history (e.g. new agent / no memory of previous conversation):
Role | Prompt or response |
---|---|
USER | The following video summary was generated by AI. Remove any language where the AI is speaking in first-person or meta-describing its analysis process. Preface your writing with 'ANALYSIS:' provided summary |
ASSISTANT | response |
Function calling or more stringent formatting would be better, but this worked well enough.
Now I had a consolidated summary and full transcription for each video. To make the metadata searchable, I used meilisearch to locally index and visualize the summaries. I followed the quick start docs and modeled my metadata after their movies.json. I slightly customized their built-in search preview and got things running on localhost. The full-text search is not as sophisticated as other alternatives, but worked well for a proof-of-concept.
If you liked this, I also recommend I accidentally built a meme search engine by Harper Reed. It focuses on static images and is a great read. I tried running some thumbnails through CLIP and saw similar results.
If I had more time, I might try these other ideas:
There are likely better ways to solve this problem from first principles - but this project was a nice way for me to locally experiment with these models.
Thanks for reading! You can find other posts and contact info here.