a benchmark to define accuracy & a deployed prototype
๐ april 2026
Benchmark: github.com/karanbirsingh/live-gurbani-captioning-benchmark-v1
Prototype: bani.karanbirsingh.com (scores around ~65-70%)
In the Sikh tradition, Gurbani Kirtan (singing of Sikh scripture) is at the center of spiritual practice.
"The poetry was initially compiled by the Sikh's 5th Guru, Arjun Dev, (1563-1606) and consisted of verses from Guru Nanak as well as the poetry of other mystics such as the Muslim Sufi 'Baba' Farid (1173-1266), the mystic-poet Kabir (1440-1518) and numerous Hindu figures including the famous 12th century bard Jayadeva.
The compilation was gradually extended by subsequent gurus โ all of whom were poets as well as musicians โ adding more verses with a clear indication of the various ragas prescribed for each of the poems."
โ darbar.org
Here is a Kirtan clip from Bhai Manbir Singh Ji.
You will notice two things:
So in this project, we explored live captioning for Gurbani:
We share the following artifacts:
The benchmark is described in detail at github.com/karanbirsingh/live-gurbani-captioning-benchmark-v1. The four chosen recordings are intentionally "simple". There is no Katha, Simran, secondary-Shabad interludes, cross-Shabad transitions, Paltas, etc.
To start, there are four annotated Kirtan recordings. One is visualized in this video:
Given audio, the system must produce (shabad_id, line_id) pairs for each timestamp. This means if a system outputs Gurmukhi text, it must normalize it to a known Shabad line to succeed. This eliminates spelling errors.
Beyond that, systems vary on two axes: whether they run live (captions at time T can only use audio from before T) or offline (the full recording is available up-front), and whether they are Shabad-aware (told the Shabad by a human) or Shabad-unaware (must identify it themselves). That gives four variants:
| Shabad-aware | Shabad-unaware | |
|---|---|---|
| Offline | easiest โ line alignment only | identify Shabad, then align |
| Live | follow along in real time | hardest โ identify & follow live |
For each Shabad, the benchmark also includes two "cold-start" copies starting at 33% and 66% in, to simulate joining the Kirtan audio late.
A system that does well here could help with other tasks too โ breaking long Kirtan programs into Shabad chapters, seeding SikhiToTheMax-driven captions for review, or organizing archival collections.
Other factors: Speech-to-text models differ in size, latency, and deployment cost. This benchmark indirectly measures latency but not cost. Production ASR systems like Google Chirp have over a billion parameters, produce close-looking Gurmukhi, but must run in the cloud at ~$1 per hour of audio. Smaller models can run for free on end-user hardware (like a phone), and the audio never leaves the device.
We prototype a system (technical details later) for the hardest variant (live and Shabad-unaware) and evaluate it on the benchmark to provide an initial baseline. The four base recordings average 70.4% frame accuracy; overall 62.7% across all 12 cases (including the eight cold-start copies that join mid-Shabad).
Each visualization below shows the correct line, the predicted line, and a green/red widget showing when predictions were accurate. Naturally, there is consistent lag between the correct line and the predicted line.
The visualization is useful to understand what leads to a particular score. You can hover over the blocks. Notice the second recording has quite a low score - we can see from the visualization that Shabad identification time is the primary issue. Additionally, state machine is too conservative at the end of the Shabad and lines do not update.
Hover a segment to see its line text; drag the bar to scrub the audio. Toggle "Show cold-start copies" to see the eight additional cases that begin 33% / 66% into each recording.
We deploy the prototype to a live website at bani.karanbirsingh.com that attempts to continuously caption Kirtan recordings from Sikhnet Radio. We also imagine an example user interface.
At a high level, the system works in three stages. A speech-to-text model consumes a rolling audio window and emits a noisy Gurmukhi transcription every few seconds. A phonetic matcher produces ranked Shabad candidates from SGGS. A state machine decides when to confirm a Shabad and/or switch from line to line.
We lightly finetuned a 118M parameter conformer from ai4bharat/indicconformer_stt_pa_hybrid_ctc_rnnt_large and exported it to INT8 ONNX for CPU inference.
On a single Apple Silicon CPU core, a 10-second window runs in ~490 ms (RTF โ 0.05). The prototype is deployed to a dedicated vCPU on Fly and runs at RTF โ 0.08.
To build training data (offline, not at inference time), we started from rough transcriptions of Kirtan recordings (from existing subtitles, larger ASR models, or our own earlier models), aligned each recording to the Shabad's canonical SGGS lines for line-level context, then snapped the rough words to each line to fix spelling drift and word-boundary errors โ keeping only the high-confidence lines as training labels.
A few notes on the user interface:
The system demonstrates the idea, but there are many ways to improve:
Thanks for reading; please excuse and forgive any mistakes.