asr
Transcribe mixed audio into timed captions. Wire the ASR node into sinks that support subtitles (segment, rtmp_push, RTC sinks, etc.).
Example
Section titled “Example”{ "name": "captions", "type": "asr", "inputs": ["mix_audio"], "config": { "provider": "deepgram", "config": { "apiKey": "<deepgram-api-key>", "language": "en-US", "model": "nova-3" } }}Providers
Section titled “Providers”| Provider | provider value |
|---|---|
| Deepgram | deepgram |
| OpenAI | openai |
| Google Cloud STT | google |
| ElevenLabs Scribe | elevenlabs |
| Azure Speech | azure |
| Volcengine | volcengine |
Each provider has its own config object (API keys, language, model). See your provider’s docs for credentials.
Output format
Section titled “Output format”ASR does not produce video or audio streams. It emits caption events with this JSON body:
| Field | Type | Description |
|---|---|---|
text | string | Transcribed text for this utterance |
startMs | number | Start time (ms from job start) |
endMs | number | End time (ms from job start) |
speaker | string | Display label for the speaker |
final | boolean | true when the utterance is committed; false for interim partials |
{ "text": "Welcome to the show.", "startMs": 12400, "endMs": 15800, "speaker": "Alice", "final": true}When audio passes through an audio_mixer, speaker is resolved from the active-speaker timeline. With per-track ASR (no mixer), it comes from the participant identity.
Providers may emit interim results (final: false) while a phrase is still being spoken. Downstream sinks decide whether to forward or keep only finals — see below.
Downstream delivery
Section titled “Downstream delivery”Wire ASR into a sink’s inputs alongside your audio/video mixers. The same caption payload is encoded differently per sink:
| Sink | How captions are delivered |
|---|---|
segment | HLS WebVTT subtitle track (.vtt); only final: true cues are written. Use captionShowSpeaker to include speaker names. |
livekit | Reliable publishData with topic avflow.caption; body is the JSON above. |
jitsi | Endpoint message { "type": "avflow.caption", … } with the same fields. |
daily | App message { "type": "avflow.caption", … }. |
agora | Data-stream message { "type": "avflow.caption", … }. |
rtmp_push | FLV onCaption script tag; JSON body; interim and final both forwarded. |
srt_push | MPEG-TS ID3 private frame (owner avflow.caption). |
whip_push | WHIP caption data channel (requires metadata sidecar). |
image does not accept caption input.
Wiring
Section titled “Wiring”room_src ──► audio_mixer ──► segment └──► asr ──┘Pricing
Section titled “Pricing”$0.012/min per speaker of actual speech (silence not billed). Node pricing.