Live captioning for Gurbani Kirtan

a benchmark and prototype for following along

📅 april 2026

Paper: arxiv.org/abs/2607.13457
Benchmark: karanbirsingh.github.io/live-gurbani-captioning-benchmark-v1
Prototype: bani.karanbirsingh.com (scores around ~65-70%)

A working session:

Overview

In the Sikh tradition, Gurbani Kirtan (singing of Sikh scripture) is at the center of spiritual practice.

"The poetry was initially compiled by the Sikh's 5th Guru, Arjun Dev, (1563-1606) and consisted of verses from Guru Nanak as well as the poetry of other mystics such as the Muslim Sufi 'Baba' Farid (1173-1266), the mystic-poet Kabir (1440-1518) and numerous Hindu figures including the famous 12th century bard Jayadeva.

The compilation was gradually extended by subsequent gurus – all of whom were poets as well as musicians – adding more verses with a clear indication of the various ragas prescribed for each of the poems."
— darbar.org

Here is a Kirtan clip from Bhai Manbir Singh Ji.

You will notice two things:

At the top, the current line is displayed. This is manually driven by a volunteer, usually through software tools like Sikhi to the Max. Since all Kirtan is sung from known poetry, these tools allow you to search and find the right verse. Sikhi To The Max recently also added a voice search feature that records a short clip and uses first-letter matching to suggest Shabads.
On the left, Youtube shows its auto-generated captions. They do not completely match the line above. Sikh scripture is considered sacred and immutable; while small typos are acceptable for generic videos (tutorials, sporting events, etc), such spelling errors in Kirtan would be considered inappropriate by most Sikh audiences.

So in this project, we explored live 'follow-along' captioning for Gurbani:

If such a system is to exist, what might the interface look like?
How is accuracy measured for such a system? Are existing transcription benchmarks the right fit?
Is it technically feasible?

We share the following:

A benchmark with four annotated Kirtan recordings
Initial baseline results from a prototype system
A prototype that tries to caption live Kirtan from Sikhnet Radio

Benchmark

The benchmark is described in detail at karanbirsingh.github.io/live-gurbani-captioning-benchmark-v1. The four chosen recordings are intentionally "simple". There is no Katha, Simran, secondary-Shabad interludes, cross-Shabad transitions, Paltas, etc.

To start, there are four annotated Kirtan recordings. One is visualized in this video:

Given audio, the system must produce (shabad_id, line_id) pairs for each timestamp. This means if a system outputs Gurmukhi text, it must normalize it to a known Shabad line to succeed. This eliminates spelling errors.

Beyond that, systems vary on two axes: whether they run live (captions at time T can only use audio from before T) or offline (the full recording is available up-front), and whether they are Shabad-aware (told the Shabad by a human) or Shabad-unaware (must identify it themselves). That gives four variants:

	Shabad-aware	Shabad-unaware
Offline	easiest — line alignment only	identify Shabad, then align
Live	follow along in real time	hardest — identify & follow live

For each Shabad, the benchmark also includes two "cold-start" copies starting at 33% and 66% in, to simulate joining the Kirtan audio late.

A system that does well here could help with other tasks too — breaking long Kirtan programs into Shabad chapters, seeding SikhiToTheMax-driven captions for review, or organizing archival collections.

Other factors: Speech-to-text models differ in size, latency, and deployment cost. This benchmark indirectly measures latency but not cost. Production ASR systems can produce close-looking Gurmukhi but are quite large. For example, Google Chirp has over a billion parameters and therefore must run in the cloud at scale (~$1 per hour of audio). Smaller models have much fewer parameters but can be useful in other ways — they can run efficiently on normal devices (like a phone) and audio can stay on the user device.

Baseline results

We prototype a system for the live and Shabad-unaware variant to set a baseline. It scores around 60-70%.

Each visualization below shows the correct line, the predicted line, and a green/red widget showing when predictions were accurate. Naturally, there is consistent lag between the correct line and the predicted line.

The visualization is useful to understand what leads to a particular score. You can hover over the blocks. Notice the second recording has quite a low score - we can see from the visualization that Shabad identification time is the primary issue. Additionally, state machine is too conservative at the end of the Shabad and lines do not update.

Cold-start copies are the same recordings started 33% and 66% of the way in, scored only on the remaining portion (dimmed region is unscored). They simulate joining the kirtan late.

Hover a segment to see its line text; drag the bar to scrub the audio. Toggle "Show cold-start copies" to see the eight additional cases that begin 33% / 66% into each recording.

Prototype

We deploy the prototype to a live website at bani.karanbirsingh.com that attempts to continuously caption Kirtan recordings from Sikhnet Radio. We also imagine an example user interface.

At a high level, the system works in three stages. A speech-to-text model consumes a rolling audio window and emits a noisy Gurmukhi transcription every few seconds. A phonetic matcher produces ranked Shabad candidates from SGGS. A state machine decides when to confirm a Shabad and/or switch from line to line.

Model

We finetuned a 118M parameter Punjabi conformer from ai4bharat on aligned Kirtan clips and exported it to INT8 ONNX for CPU inference.

On a single Apple Silicon CPU core, a 10-second window runs in ~490 ms (RTF ≈ 0.05). The prototype is deployed to a dedicated vCPU on Fly and runs at RTF ≈ 0.08.

To build training data, we started from rough transcriptions of Kirtan recordings (from existing subtitles, larger ASR models, or our own earlier models), aligned each recording to the Shabad's canonical SGGS lines for line-level context, then snapped the rough words to each line to fix spelling drift and word-boundary errors — keeping only the high-confidence lines as training labels.

Shabad 4725 · line 3 · 0:00–0:01

SGGS lineਐਸੇ ਗੁਰ ਕਉ ਬਲਿ ਬਲਿ ਜਾਈਐ ਆਪਿ ਮੁਕਤੁ ਮੋਹਿ ਤਾਰੈ

ASRਐਸੇ ਗੁਰ ਕੋ

Snapਐਸੇ ਗੁਰ ਕਉ

Shabad 1341 · line 1 · 2:21–2:30

SGGS lineਕਾਲਬੂਤ ਕੀ ਹਸਤਨੀ ਮਨ ਬਉਰਾ ਰੇ ਚਲਤੁ ਰਚਿਓ ਜਗਦੀਸ

ASRਕਾਲ ਬੂਤ ਕੀ ਹਸਤਨੀ ਮਨ ਬਉਰਾ ਰੇ

Snapਕਾਲਬੂਤ ਕੀ ਹਸਤਨੀ ਮਨ ਬਉਰਾ ਰੇ

Shabad 2776 · line 6 · 0:00–0:14

SGGS lineਜਿਨ ਸਤਿਗੁਰੁ ਪਿਆਰਾ ਦੇਖਿਆ ਤਿਨ ਕਉ ਹਉ ਵਾਰੀ

ASRਜਿਨਿ ਸਤਿ ਗੁਰ ਪਿਆਰਾ ਦੇਖਿਆ ਤਿਨ ਕਉ ਵਾਰੀ

Snapਜਿਨ ਸਤਿਗੁਰੁ ਪਿਆਰਾ ਦੇਖਿਆ ਤਿਨ ਕਉ ਹਉ ਵਾਰੀ

Interface

A few notes on the user interface:

In the benchmark, the system attempts to identify the Shabad autonomously. In the webpage, the top candidates are displayed and any listening user can explicitly confirm.
During line tracking, an "incorrect? reset tracking" button moves the system back into Shabad candidate identification.
User actions are broadcast to all listeners.

Limitations

The system demonstrates the idea, but there are many ways to improve:

A better trained model (whether fine-tuned or proprietary like Chirp) would reduce matching errors
Deploying on the edge will enable parallel sessions
The model can only match to SGGS, so Dasam Bani is not identified yet
The line-tracking is sensitive to speed; the system struggles with very slow or very quick recitation
The system does not leverage knowledge about "adjacent" Shabads; so recitation of Akhand Paath or Nitnem is not tracked smoothly
Katha, Simran, intra-Shabad interludes are not detected separately
The model degrades on low-quality audio
…and many more

Sikhi to the Max has a voice search feature; it tries to find the first letter of each word in the recording and map to its usual search.
Surinder Singh Veerji from Sacramento is collating datasets and fine-tuning a 200M-parameter whisper-based model: huggingface.co/surindersinghssj. This data is quite extensive!

Thanks for reading; please excuse and forgive any mistakes.

— More writing at karanbirsingh.com →

↑