Skip to content
Console

asr

Transcribe mixed audio into timed captions. Wire the ASR node into sinks that support subtitles (segment, rtmp_push, RTC sinks, etc.).

{
"name": "captions",
"type": "asr",
"inputs": ["mix_audio"],
"config": {
"provider": "deepgram",
"config": {
"apiKey": "<deepgram-api-key>",
"language": "en-US",
"model": "nova-3"
}
}
}
Providerprovider value
Deepgramdeepgram
OpenAIopenai
Google Cloud STTgoogle
ElevenLabs Scribeelevenlabs
Azure Speechazure
Volcenginevolcengine

Each provider has its own config object (API keys, language, model). See your provider’s docs for credentials.

ASR does not produce video or audio streams. It emits caption events with this JSON body:

FieldTypeDescription
textstringTranscribed text for this utterance
startMsnumberStart time (ms from job start)
endMsnumberEnd time (ms from job start)
speakerstringDisplay label for the speaker
finalbooleantrue when the utterance is committed; false for interim partials
{
"text": "Welcome to the show.",
"startMs": 12400,
"endMs": 15800,
"speaker": "Alice",
"final": true
}

When audio passes through an audio_mixer, speaker is resolved from the active-speaker timeline. With per-track ASR (no mixer), it comes from the participant identity.

Providers may emit interim results (final: false) while a phrase is still being spoken. Downstream sinks decide whether to forward or keep only finals — see below.

Wire ASR into a sink’s inputs alongside your audio/video mixers. The same caption payload is encoded differently per sink:

SinkHow captions are delivered
segmentHLS WebVTT subtitle track (.vtt); only final: true cues are written. Use captionShowSpeaker to include speaker names.
livekitReliable publishData with topic avflow.caption; body is the JSON above.
jitsiEndpoint message { "type": "avflow.caption", … } with the same fields.
dailyApp message { "type": "avflow.caption", … }.
agoraData-stream message { "type": "avflow.caption", … }.
rtmp_pushFLV onCaption script tag; JSON body; interim and final both forwarded.
srt_pushMPEG-TS ID3 private frame (owner avflow.caption).
whip_pushWHIP caption data channel (requires metadata sidecar).

image does not accept caption input.

room_src ──► audio_mixer ──► segment
└──► asr ──┘

$0.012/min per speaker of actual speech (silence not billed). Node pricing.