a benchmark and prototype for following along
📅 april 2026
Benchmark: karanbirsingh.github.io/live-gurbani-captioning-benchmark-v1
Prototype: bani.karanbirsingh.com (scores around ~65-70%)
In the Sikh tradition, Gurbani Kirtan (singing of Sikh scripture) is at the center of spiritual practice.
"The poetry was initially compiled by the Sikh's 5th Guru, Arjun Dev, (1563-1606) and consisted of verses from Guru Nanak as well as the poetry of other mystics such as the Muslim Sufi 'Baba' Farid (1173-1266), the mystic-poet Kabir (1440-1518) and numerous Hindu figures including the famous 12th century bard Jayadeva.
The compilation was gradually extended by subsequent gurus – all of whom were poets as well as musicians – adding more verses with a clear indication of the various ragas prescribed for each of the poems."
— darbar.org
Here is a Kirtan clip from Bhai Manbir Singh Ji.
You will notice two things:
So in this project, we explored live 'follow-along' captioning for Gurbani:
We share the following:
The benchmark is described in detail at karanbirsingh.github.io/live-gurbani-captioning-benchmark-v1. The four chosen recordings are intentionally "simple". There is no Katha, Simran, secondary-Shabad interludes, cross-Shabad transitions, Paltas, etc.
To start, there are four annotated Kirtan recordings. One is visualized in this video:
Given audio, the system must produce (shabad_id, line_id) pairs for each timestamp. This means if a system outputs Gurmukhi text, it must normalize it to a known Shabad line to succeed. This eliminates spelling errors.
Beyond that, systems vary on two axes: whether they run live (captions at time T can only use audio from before T) or offline (the full recording is available up-front), and whether they are Shabad-aware (told the Shabad by a human) or Shabad-unaware (must identify it themselves). That gives four variants:
| Shabad-aware | Shabad-unaware | |
|---|---|---|
| Offline | easiest — line alignment only | identify Shabad, then align |
| Live | follow along in real time | hardest — identify & follow live |
For each Shabad, the benchmark also includes two "cold-start" copies starting at 33% and 66% in, to simulate joining the Kirtan audio late.
A system that does well here could help with other tasks too — breaking long Kirtan programs into Shabad chapters, seeding SikhiToTheMax-driven captions for review, or organizing archival collections.
Other factors: Speech-to-text models differ in size, latency, and deployment cost. This benchmark indirectly measures latency but not cost. Production ASR systems can produce close-looking Gurmukhi but are quite large. For example, Google Chirp has over a billion parameters and therefore must run in the cloud at scale (~$1 per hour of audio). Smaller models have much fewer parameters but can be useful in other ways — they can run efficiently on normal devices (like a phone) and audio can stay on the user device.
We prototype a system for the live and Shabad-unaware variant to set a baseline. It scores around 60-70%.
Each visualization below shows the correct line, the predicted line, and a green/red widget showing when predictions were accurate. Naturally, there is consistent lag between the correct line and the predicted line.
The visualization is useful to understand what leads to a particular score. You can hover over the blocks. Notice the second recording has quite a low score - we can see from the visualization that Shabad identification time is the primary issue. Additionally, state machine is too conservative at the end of the Shabad and lines do not update.
Hover a segment to see its line text; drag the bar to scrub the audio. Toggle "Show cold-start copies" to see the eight additional cases that begin 33% / 66% into each recording.
We deploy the prototype to a live website at bani.karanbirsingh.com that attempts to continuously caption Kirtan recordings from Sikhnet Radio. We also imagine an example user interface.
At a high level, the system works in three stages. A speech-to-text model consumes a rolling audio window and emits a noisy Gurmukhi transcription every few seconds. A phonetic matcher produces ranked Shabad candidates from SGGS. A state machine decides when to confirm a Shabad and/or switch from line to line.
We finetuned a 118M parameter Punjabi conformer from ai4bharat on aligned Kirtan clips and exported it to INT8 ONNX for CPU inference.
On a single Apple Silicon CPU core, a 10-second window runs in ~490 ms (RTF ≈ 0.05). The prototype is deployed to a dedicated vCPU on Fly and runs at RTF ≈ 0.08.
To build training data, we started from rough transcriptions of Kirtan recordings (from existing subtitles, larger ASR models, or our own earlier models), aligned each recording to the Shabad's canonical SGGS lines for line-level context, then snapped the rough words to each line to fix spelling drift and word-boundary errors — keeping only the high-confidence lines as training labels.
A few notes on the user interface:
The system demonstrates the idea, but there are many ways to improve:
Thanks for reading; please excuse and forgive any mistakes.