多模态向量化

多模态 Embedding 模型将文本、图像和视频转换为数值向量，用于跨模态搜索（文搜图、图搜图、文搜视频）和内容检索。

多模态 Embedding 需使用 DashScope SDK 或 HTTP API，不支持 OpenAI 兼容接口。

独立向量

为每种输入（文本、图像或视频）分别生成独立向量。适合为图片和文字标题各自建立索引。

Python
Java

import dashscope
import json
import os

image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"

resp = dashscope.MultiModalEmbedding.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-vl-embedding",  # <-- 推荐模型，支持 33 种语言
  input=[{"image": image}],
  # dimension=1024,  # <-- 可选，默认 2560
)

print(json.dumps(resp.output, indent=4))

import com.alibaba.dashscope.embeddings.MultiModalEmbedding;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingItemImage;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingParam;
import com.alibaba.dashscope.embeddings.MultiModalEmbeddingResult;
import com.alibaba.dashscope.exception.ApiException;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.exception.UploadFileException;

import java.util.Collections;

public class Main {
  public static void main(String[] args) {
    try {
      MultiModalEmbedding embedding = new MultiModalEmbedding();
      MultiModalEmbeddingItemImage image = new MultiModalEmbeddingItemImage(
        "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png");

      MultiModalEmbeddingParam param = MultiModalEmbeddingParam.builder()
        .model("qwen3-vl-embedding")
        .contents(Collections.singletonList(image))
        .build();

      MultiModalEmbeddingResult result = embedding.call(param);
      System.out.println(result);

    } catch (ApiException | NoApiKeyException | UploadFileException e) {
      System.err.println(e.getMessage());
    }
  }
}

视频输入：将 {"image": image} 替换为 {"video": video_url}。

融合向量

将多模态输入（文本 + 图片 + 视频）编码为 1 个向量。适合图文混合检索——例如输入一张衬衫图片加上文本"找相似但更年轻的款式"，模型将图像和文本指令融合为一个向量。

import dashscope
import json
import os

text = "白色运动鞋，轻量透气"
image = "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"

resp = dashscope.MultiModalEmbedding.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen3-vl-embedding",
  input=[
    {"text": text},
    {"image": image},
  ],
  enable_fusion=True,  # <-- 开启融合向量
  # dimension=1024,  # <-- 可选，默认 2560
)

print(json.dumps(resp.output, indent=4))

各模型的融合方式

模型	融合方式
`qwen3-vl-embedding`	添加 `enable_fusion=True` 参数
`qwen2.5-vl-embedding`	默认即为融合模式，无需额外参数
`tongyi-embedding-vision-plus-2026-03-06`	将 text、image 放在同一个 content 对象中：`[{"text": ..., "image": ...}]`
`tongyi-embedding-vision-flash-2026-03-06`	同上

tongyi-embedding-vision-plus-2026-03-06 融合示例

该模型通过将 text 和 image 放在同一个 content 对象中实现融合，无需 enable_fusion 参数。

import dashscope
import json
import os

resp = dashscope.MultiModalEmbedding.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="tongyi-embedding-vision-plus-2026-03-06",
  input=[
    {"text": "白色运动鞋", "image": "https://dashscope.oss-cn-beijing.aliyuncs.com/images/256_1.png"}
  ],  # <-- text 和 image 在同一对象中，自动融合
  dimension=1152,
)

print(json.dumps(resp.output, indent=4))

可用模型

模型	向量类型	维度	最大 Token	语言
`qwen3-vl-embedding`	融合 + 独立	256~2560（默认 2560）	32,000	33 种语言
`qwen2.5-vl-embedding`	仅融合	512~2048（默认 1024）	32,000	11 种语言
`tongyi-embedding-vision-plus-2026-03-06`	融合 + 独立	64~1152（默认 1152）	1,024	30+ 种语言
`tongyi-embedding-vision-flash-2026-03-06`	融合 + 独立	64~768（默认 768）	1,024	30+ 种语言
`tongyi-embedding-vision-plus`	仅独立	64~1152（默认 1152）	1,024	中、英
`tongyi-embedding-vision-flash`	仅独立	64~768（默认 768）	1,024	中、英

图片限制：qwen3-vl 单张最大 10 MB；qwen2.5-vl 单张最大 5 MB；tongyi 系列单张最大 5 MB（2026-03-06 版本）或 3 MB（旧版），支持 URL 或 Base64。
视频限制：qwen3-vl / qwen2.5-vl 最大 50 MB；tongyi 2026-03-06 版本最大 50 MB，旧版最大 10 MB。仅支持 URL。
只有文本数据？ 使用 text-embedding-v4——更快、更便宜、维度选择更多。

多模态向量化

独立向量

融合向量

各模型的融合方式

可用模型

了解更多

模型概览

API 参考

​独立向量

​融合向量

​各模型的融合方式

​可用模型

​了解更多

模型概览

API 参考

独立向量

融合向量

各模型的融合方式

可用模型

了解更多