asr

Transcribe mixed audio into timed captions. Wire the ASR node into sinks that support subtitles (segment, rtmp_push, RTC sinks, etc.).

Example

{
  "name": "captions",
  "type": "asr",
  "inputs": ["mix_audio"],
  "config": {
    "provider": "deepgram",
    "config": {
      "apiKey": "<deepgram-api-key>",
      "language": "en-US",
      "model": "nova-3"
    }
  }
}

Providers

Provider	`provider` value
Deepgram	`deepgram`
OpenAI	`openai`
Google Cloud STT	`google`
ElevenLabs Scribe	`elevenlabs`
Azure Speech	`azure`
Volcengine	`volcengine`

Each provider has its own config object (API keys, language, model). See your provider’s docs for credentials.

Output format

ASR does not produce video or audio streams. It emits caption events with this JSON body:

Field	Type	Description
`text`	string	Transcribed text for this utterance
`startMs`	number	Start time (ms from job start)
`endMs`	number	End time (ms from job start)
`speaker`	string	Display label for the speaker
`final`	boolean	`true` when the utterance is committed; `false` for interim partials

{
  "text": "Welcome to the show.",
  "startMs": 12400,
  "endMs": 15800,
  "speaker": "Alice",
  "final": true
}

When audio passes through an audio_mixer, speaker is resolved from the active-speaker timeline. With per-track ASR (no mixer), it comes from the participant identity.

Providers may emit interim results (final: false) while a phrase is still being spoken. Downstream sinks decide whether to forward or keep only finals — see below.

Downstream delivery

Wire ASR into a sink’s inputs alongside your audio/video mixers. The same caption payload is encoded differently per sink:

Sink	How captions are delivered
`segment`	HLS WebVTT subtitle track (`.vtt`); only `final: true` cues are written. Use `captionShowSpeaker` to include speaker names.
`livekit`	Reliable `publishData` with topic `avflow.caption`; body is the JSON above.
`jitsi`	Endpoint message `{ "type": "avflow.caption", … }` with the same fields.
`daily`	App message `{ "type": "avflow.caption", … }`.
`agora`	Data-stream message `{ "type": "avflow.caption", … }`.
`rtmp_push`	FLV `onCaption` script tag; JSON body; interim and final both forwarded.
`srt_push`	MPEG-TS ID3 private frame (owner `avflow.caption`).
`whip_push`	WHIP caption data channel (requires metadata sidecar).

image does not accept caption input.

Wiring

room_src ──► audio_mixer ──► segment
                         └──► asr ──┘

Pricing

$0.012/min per speaker of actual speech (silence not billed). Node pricing.