音频理解 - 千问云

Qwen3-Omni-Captioner 是基于 Qwen3-Omni 构建的开源模型，无需提示词即可为复杂音频生成描述，涵盖语音、环境音、音乐和音效等内容。该模型能识别说话者的情绪、音乐风格和乐器等元素，以及敏感信息。

可用模型

模型	上下文窗口	最大输入	最大输出	输入费用	输出费用	免费额度 (说明)
qwen3-omni-30b-a3b-captioner	65,536	32,768	32,768	15.8元	12.7元	100 万 tokens，开通千问云后 90 天内有效

音频的 Token 换算规则：总 Token 数 = 音频时长（秒）× 12.5。音频时长不足 1 秒按 1 秒计算。

快速开始

前提条件

获取 API Key 并将其导出为环境变量。
如果使用 SDK 调用，请安装最新版本的 SDK。

Qwen3-Omni-Captioner 仅支持 API 调用，暂不支持模型体验。以下示例分析通过 URL 指定的在线音频。如需使用本地文件，请参见如何传入本地文件。文件要求请参见限制条件。

OpenAI 兼容
DashScope

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]
)
print(completion.choices[0].message.content)

完整 JSON 响应

{
  "choices": [
    {
      "message": {
        "content": "The audio clip is a brief, low-fidelity recording-approximately six seconds long-captured in a small, reverberant indoor space, likely a home office or bedroom. It opens with a rapid, metallic, rhythmic hammering sound, repeating every 0.5 to 0.6 seconds, with each strike slightly uneven and accompanied by a short echo. This sound dominates the left side of the stereo field and is close to the microphone, suggesting the hammering is occurring nearby and slightly to the left.\n\nOverlaid with the hammering, a single male voice speaks in Mandarin Chinese, his tone clearly one of frustration and exasperation. He says, \"Oh, with this, how am I supposed to work quietly?\" His speech is clear despite the poor audio quality, and is delivered in a standard, unaccented Mandarin, indicative of a native speaker from northern or central China.\n\nThe voice is more distant and centered in the stereo field, with more room reverberation than the hammering. The emotional content is palpable: his voice rises slightly at the end, turning the phrase into a rhetorical complaint, underscoring his irritation. No other voices, music, or ambient sounds are present; the only non-speech sounds are the hammering and the faint hiss of the recording device.\n\nThe combination of the environmental sound, the speaker's language, and his tone strongly suggests a scenario of home office disruption-perhaps someone working from home is being disturbed by renovation or repair work happening nearby. The recording ends abruptly, mid-hammer, further emphasizing the spontaneous and candid nature of the capture.\n\nIn summary, the audio is a realistic, low-fidelity snapshot of a Mandarin-speaking man, likely in China, expressing frustration at being unable to work in peace due to nearby construction or repair activity, captured in a personal, indoor setting.",
        "role": "assistant"
      },
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null
    }
  ],
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 160,
    "completion_tokens": 387,
    "total_tokens": 547,
    "prompt_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "completion_tokens_details": {
      "text_tokens": 387
    }
  },
  "created": 1758002134,
  "system_fingerprint": null,
  "model": "qwen3-omni-30b-a3b-captioner",
  "id": "chatcmpl-f4155bf9-b860-49d6-8ee2-092da7359097"
}

import dashscope
import os

dashscope.base_http_api_url="https://dashscope.aliyuncs.com/api/v1"

messages = [
  {
    "role": "user",
    "content": [
      {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
    ]
  }
]

response = dashscope.MultiModalConversation.call(
  # 如果未配置环境变量，请将下面一行替换为：api_key="sk-xxx",
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages
)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

完整 JSON 响应

{
  "output":{
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "The audio clip is a 6-second, high-fidelity recording set in a quiet, indoor environment. The primary sound is a male speaker, likely in his late teens to mid-20s, speaking Mandarin Chinese in a tone of mild exasperation. His speech is clear and natural, delivered in a conversational manner: \"Oh, how can I possibly work quietly like this?\" His voice is close to the microphone, and the room is acoustically neutral, with no noticeable echo or background noise, suggesting a small, well-furnished space.\n\nOverlaying the speech is a persistent, rhythmic mechanical sound-a series of sharp, metallic clicks or clatters that repeat every 0.6 seconds. The sound is dry and lacks any reverberation, further supporting the inference that it is produced by a mechanical device very close to the microphone. The regularity and timbre of the sound suggest a small, metallic object (such as a key, coin, or pen) being repeatedly tapped or struck on a hard surface, rather than a larger or more complex machine.\n\nThe speaker's complaint is a direct response to the mechanical noise, expressing frustration at being unable to concentrate or work in peace due to the disturbance. The tone is not angry or urgent, but rather one of resigned annoyance, typical of someone encountering a minor, persistent annoyance in a personal or domestic setting.\n\nThere are no other voices, music, or environmental cues present. The overall impression is of a brief, candid moment-perhaps a student, office worker, or someone in a quiet home environment-caught on microphone while complaining (to themselves or a nearby companion) about a distracting, repetitive noise. The recording is technically clean and focused, with all attention on the speaker and the mechanical sound, making it highly plausible that the clip was captured intentionally, possibly for a voice note, social media post, or as a sample for a sound effect library."
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "input_tokens_details": {
      "audio_tokens": 152,
      "text_tokens": 8
    },
    "total_tokens": 559,
    "output_tokens": 399,
    "input_tokens": 160,
    "output_tokens_details": {
      "text_tokens": 399
    }
  },
  "request_id": "d532f72c-e75b-4ffb-a1ef-d2465e758958"
}

工作原理

单轮交互：模型不支持多轮对话，每次请求都是独立的分析任务。
固定任务：模型的核心任务是生成英文音频描述。您无法通过指令（如 system message）改变其行为，例如控制输出格式或内容侧重点。
仅接受音频输入：模型仅接受音频作为输入，无需传入文本提示词。message 参数的格式是固定的。

message 格式示例

OpenAI 兼容：

messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            "data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"
          }
        }
      ]
    }
  ]

DashScope：

messages = [
  {
    "role": "user",
    "content": [
      {"audio": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20240916/xvappi/%E8%A3%85%E4%BF%AE%E5%99%AA%E9%9F%B3.wav"}
    ]
  }
]

流式输出

流式输出的通用概念（SSE 协议、如何开启流式、计费和 Token 用量）请参见流式输出。本节仅介绍音频理解特有的流式行为。

在调用中添加 stream: true 即可开启流式输出。流式行为与标准文本流式输出完全一致，唯一区别是输入消息的格式为音频而非文本。使用快速开始中的消息格式，添加流式参数即可：

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[{
    "role": "user",
    "content": [{"type": "input_audio", "input_audio": {"data": "<audio-url>"}}]
  }],
  stream=True,
  stream_options={"include_usage": True},
)
for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

传入本地文件 (Base64 编码或文件路径)

模型支持两种上传本地文件的方式：

Base64 编码上传
直接传入文件路径（推荐，传输更稳定）

上传方式：

Pass by file path
Pass by Base64 encoding

直接将文件路径传入模型。此方式仅支持 DashScope Python 和 Java SDK，不支持 HTTP 调用。请参考下表根据编程语言和操作系统指定文件路径。

指定文件路径

系统	SDK	输入文件路径	示例
Linux 或 macOS	Python SDK	`file://<文件绝对路径>`	`file:///home/images/test.mp3`
Linux 或 macOS	Java SDK	`file://<文件绝对路径>`	`file:///home/images/test.mp3`
Windows	Python SDK	`file://<文件绝对路径>`	`file://D:/images/test.mp3`
Windows	Java SDK	`file:///<文件绝对路径>`	`file:///D:/images/test.mp3`

将文件转换为 Base64 编码字符串后传入模型。

传入 Base64 编码字符串的步骤

编码文件

将本地音频文件转换为 Base64 字符串。

示例：将音频文件转换为 Base64 字符串

import base64

# 编码函数：将本地文件转换为 Base64 编码字符串
def encode_audio(audio_path):
  with open(audio_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")

# 将 xxxx/test.mp3 替换为本地音频文件的绝对路径
base64_audio = encode_audio("xxxx/test.mp3")

构造 Data URL

按以下格式构造 Data URL：data:;base64,{base64_audio}，其中 base64_audio 为上一步生成的 Base64 字符串。

调用模型

通过 audio（DashScope SDK）或 input_audio（OpenAI SDK）参数传入 Data URL。

限制：

推荐直接传入文件路径，传输更稳定。也可以使用 Base64 编码传入小于 1 MB 的文件。
直接传入文件路径时，音频文件大小不超过 10 MB。
使用 Base64 编码传入时，编码后的字符串大小不超过 10 MB。Base64 编码会增大数据体积。

代码示例

Pass by file path
Pass by Base64 encoding

直接传入文件路径仅支持 DashScope Python 和 Java SDK，不支持 HTTP 调用。

import dashscope
import os

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

# 将 ABSOLUTE_PATH/welcome.mp3 替换为本地音频文件的绝对路径。
# 本地文件的完整路径必须加上 file:// 前缀，例如：file:///home/images/test.mp3
audio_file_path = "file://ABSOLUTE_PATH/welcome.mp3"
messages = [
  {
    "role": "user",
    # 在 audio 参数中传入带 file:// 前缀的文件路径。
    "content": [{"audio": audio_file_path}],
  }
]

response = dashscope.MultiModalConversation.call(
  # 如果未配置环境变量，请将下面一行替换为：api_key="sk-xxx"
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages)

print("Output:")
print(response["output"]["choices"][0]["message"].content[0]["text"])

OpenAI 兼容
DashScope

import os
from openai import OpenAI
import base64

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

def encode_audio(audio_path):
  with open(audio_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")


# 将 ABSOLUTE_PATH/welcome.mp3 替换为本地音频文件的绝对路径。
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

completion = client.chat.completions.create(
  model="qwen3-omni-30b-a3b-captioner",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "input_audio",
          "input_audio": {
            # 使用 Base64 编码传入本地文件时，必须添加 data: 前缀以确保文件 URL 有效。
            # 在 Base64 编码数据（base64_audio）之前必须包含 "base64" 关键字，否则会报错。
            "data": f"data:;base64,{base64_audio}"
          },
        }
      ],
    },
  ]
)
print(completion.choices[0].message.content)

import os
import base64
import dashscope

dashscope.base_http_api_url="https://dashscope.aliyuncs.com/api/v1"
# 编码函数：将本地文件转换为 Base64 编码字符串
def encode_audio(audio_file_path):
  with open(audio_file_path, "rb") as audio_file:
    return base64.b64encode(audio_file.read()).decode("utf-8")

# 将 ABSOLUTE_PATH/welcome.mp3 替换为本地音频文件的绝对路径。
audio_file_path = "xxx/ABSOLUTE_PATH/welcome.mp3"
base64_audio = encode_audio(audio_file_path)

messages = [
  {
    "role": "user",
    # 使用 Base64 编码传入本地文件时，必须添加 data: 前缀以确保文件 URL 有效。
    # 在 Base64 编码数据（base64_audio）之前必须包含 "base64" 关键字，否则会报错。
    "content": [{"audio":f"data:;base64,{base64_audio}"}],
  }
]

response = dashscope.MultiModalConversation.call(
  # 如果未配置环境变量，请将下面一行替换为：api_key="sk-xxx"
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-omni-30b-a3b-captioner",
  messages=messages,
  )
print(response.output.choices[0].message.content[0]["text"])

API 参考

Qwen3-Omni-Captioner 的输入输出参数请参见 Chat completions API。

错误码

调用失败时，请参见错误信息。

常见问题

如何压缩音频文件到要求的大小？

在线工具：可使用 Compresss 等在线工具压缩音频文件。
代码实现：可使用 FFmpeg 工具。详细用法请参见 FFmpeg 官方网站。

# 基础转换命令（通用模板）
# -i：指定输入文件路径。示例：input.mp3

# -b:a：设置音频比特率。
    # 常用值：64 kbps（低质量，适用于语音和低带宽流媒体）、128k（中等质量，适用于一般音频和播客）、192 kbps（高质量，适用于音乐和广播）。
    # 比特率越高，音质越好，文件越大。

# -ar：设置音频采样率，即每秒采样次数。
  # 常用值：8000 Hz、22050 Hz、44100 Hz（标准采样率）。
  # 采样率越高，文件越大。

# -ac：设置音频声道数。常用值：1（单声道）、2（立体声）。单声道文件更小。

# -y：如果输出文件已存在则覆盖（无需指定值）。# output.mp3：指定输出文件路径。

ffmpeg -i input.mp3 -b:a 128k -ar 44100 -ac 1 output.mp3 -y

限制条件

模型对音频文件有以下限制：

时长：不超过 40 分钟。
文件数量：每次请求仅支持一个音频文件。
文件格式：支持 AMR、WAV (CodecID: GSM_MS)、WAV (PCM)、3GP、3GPP、AAC 和 MP3。
文件输入方式：公开可访问的音频 URL、Base64 编码或本地文件路径。
文件大小：
- 公开 URL：不超过 1 GB。
- 文件路径：音频文件不超过 10 MB。
- Base64 编码：编码后的 Base64 字符串不超过 10 MB。详情请参见传入本地文件。
如需压缩文件，请参见如何压缩音频文件到要求的大小？

替代方案：使用 Qwen-Omni

您也可以使用 Qwen-Omni（qwen3-omni-flash）配合提示词进行音频理解。与 Qwen3-Omni-Captioner 无需提示词直接生成描述不同，Qwen-Omni 允许您对音频提出具体问题。

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
  model="qwen3-omni-flash",
  messages=[{
    "role": "user",
    "content": [
      {"type": "input_audio", "input_audio": {"data": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20250211/tixcef/cherry.wav"}},
      {"type": "text", "text": "What is being said in this audio? Describe the speaker's emotion."}
    ]
  }],
  modalities=["text"],
  stream=True,
)

for chunk in completion:
  if chunk.choices and chunk.choices[0].delta.content:
    print(chunk.choices[0].delta.content, end="")

Qwen-Omni 的完整功能（包括带音频输出的多模态对话），请参见音视频文件理解。

​可用模型

​快速开始

​工作原理

​流式输出

​传入本地文件 (Base64 编码或文件路径)

​API 参考

​错误码

​常见问题

​限制条件

​替代方案：使用 Qwen-Omni

可用模型

快速开始

工作原理

流式输出

传入本地文件 (Base64 编码或文件路径)

API 参考

错误码

常见问题

限制条件

替代方案：使用 Qwen-Omni