LiveTranslate client events

客户端通过 WebSocket 向服务端发送的事件。

Connect

建立 WebSocket 连接以启动会话。连接就绪后，服务端会发送 session.created 事件。

配置项	值
端点	`wss://dashscope.aliyuncs.com/api-ws/v1/realtime`
查询参数	`model=qwen3.5-livetranslate-flash-realtime`
鉴权头	`Authorization: Bearer $DASHSCOPE_API_KEY`
协议	JSON 文本帧

完整 URL：

wss://dashscope.aliyuncs.com/api-ws/v1/realtime?model=qwen3.5-livetranslate-flash-realtime

session.update

连接建立后更新会话配置。服务端会校验参数并返回完整配置；如果参数无效则返回错误。

Example

{
  "event_id": "event_ToPZqeobitzUJnt3QqtWg",
  "type": "session.update",
  "session": {
    "modalities": [
      "text",
      "audio"
    ],
    "voice": "Tina",
    "sample_rate": 16000,
    "input_audio_format": "pcm",
    "output_audio_format": "pcm",
    "input_audio_transcription": {
      "model": "qwen3-asr-flash-realtime",
      "language": "zh"
    },
    "translation": {
      "language": "en"
    }
  }
}

启用声音复刻（frequency=once）的示例：

Example (voice clone)

{
  "event_id": "event_ToPZqeobitzUJnt3QqtWg",
  "type": "session.update",
  "session": {
    "modalities": [
      "text",
      "audio"
    ],
    "voice": "default",
    "enable_voice_clone": true,
    "voice_clone_options": {
      "frequency": "once"
    },
    "sample_rate": 16000,
    "input_audio_format": "pcm",
    "output_audio_format": "pcm",
    "translation": {
      "language": "en"
    }
  }
}

string

body

必填

固定为 "session.update"。

object

body

会话配置。

显示properties

array

body

输出类型。可选值：

["text"] — 仅文本。
["text", "audio"]（默认）— 文本和音频。

string

body

生成音频的音色。未启用声音复刻时，可设置为系统预设音色，参见支持的音色。Qwen3.5-LiveTranslate-Flash-Realtime 默认音色为 Tina，Qwen3-LiveTranslate-Flash-Realtime 默认音色为 Cherry。启用声音复刻（enable_voice_clone 为 true）时，voice 的取值取决于 frequency：当 frequency 为 once 或 always 时，必须设置为 default；当 frequency 为 never 时，设置为用户预先复刻的音色 ID。

boolean

body

是否启用声音复刻。默认值为 false。启用后，模型会基于输入音频复刻音色用于翻译输出，此时 voice 不再使用系统预设音色，需设置为 default 或用户预先复刻的音色 ID。

object

body

声音复刻控制参数，仅在 enable_voice_clone 为 true 时生效。

显示properties

string

body

音色复刻频率。可选值：

never — 不在服务端进行音色复刻，使用用户预先复刻好的音色。此时 voice 需设置为用户的复刻音色 ID。
once — 会话开始时基于输入音频进行一次音色复刻，后续输出复用该音色。适合单人演讲场景。此时 voice 需设置为 default。
always — 每次输出前基于输入音频进行实时音色复刻，音色跟随输入动态变化。适合多人对话场景。此时 voice 需设置为 default。

integer

body

输入音频的采样率，单位为 Hz。可选值：8000、16000（默认）、24000、48000。

object

body

输入音频设置。

显示properties

string

body

语音识别模型。设置后，服务端会通过 conversation.item.input_audio_transcription.text 和 conversation.item.input_audio_transcription.completed 事件返回源语言文本。可选值：qwen3-asr-flash-realtime。

string

body

源语言。参见支持的语种。默认值：en。

string

body

输入音频格式。可选值：pcm（默认）、g711_alaw。

string

body

输出音频格式。当前仅支持设为 pcm。

object

body

翻译设置。

显示properties

string

body

目标语言。参见支持的语种。默认值：en。

object

body

热词配置，用于提升特定词汇的翻译准确性。

显示properties

object

body

热词映射表。key 为源语言词汇，value 为目标语言对应翻译。示例：{"人工智能": "Artificial Intelligence"}。

input_audio_buffer.append

向输入缓冲区追加音频数据。服务端使用该缓冲区进行语音检测和提交时机判断。

Example

{
  "event_id": "event_xxx",
  "type": "input_audio_buffer.append",
  "audio": "xxx"
}

string

body

必填

固定为 "input_audio_buffer.append"。

string

body

必填

Base64 编码的音频数据。

input_image_buffer.append

从本地文件或实时视频流添加图像数据到缓冲区。图像限制：

格式：JPG 或 JPEG。推荐分辨率：480p 或 720p。最大：1080p。
最大文件大小：500 KB（Base64 编码前）。
必须进行 Base64 编码。
最大速率：每秒 2 张图像。
建议先发送至少一个 input_audio_buffer.append 事件，以确保服务端有音频上下文。

Example

{
  "event_id": "event_xxx",
  "type": "input_image_buffer.append",
  "image": "xxx"
}

string

body

必填

固定为 "input_image_buffer.append"。

string

body

必填

Base64 编码的图像数据。

session.finish

结束会话。服务端根据是否检测到语音做出不同响应：

检测到语音：服务端完成识别，先发送 conversation.item.input_audio_transcription.completed 返回结果，再发送 session.finished。
未检测到语音：服务端直接发送 session.finished。

收到 session.finished 后断开连接。

Example

{
  "event_id": "event_xxx",
  "type": "session.finish"
}

string

body

必填

固定为 "session.finish"。

​Connect

​session.update

​input_audio_buffer.append

​input_image_buffer.append

​session.finish

Connect

session.update

input_audio_buffer.append

input_image_buffer.append

session.finish