文本向量化

前提条件

获取 API Key 并将其设置为环境变量。如需使用 SDK，请先安装 SDK。

获取向量

调用 API 时，在请求中指定要向量化的文本和模型名称。

OpenAI 兼容
DashScope

import os
from openai import OpenAI

input_text = "The quality of the clothes is excellent"

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

completion = client.embeddings.create(
  model="text-embedding-v4",
  input=input_text
)

print(completion.model_dump_json())

import dashscope
from http import HTTPStatus

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

input_text = "The quality of the clothes is excellent"
resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input=input_text,
)

if resp.status_code == HTTPStatus.OK:
  print(resp)

支持的模型

模型	向量维度	批次大小	每批最大 Token 数	支持的语言
text-embedding-v4（属于 Qwen3-Embedding 系列）	2,048、1,536、1,024（默认）、768、512、256、128、64	10	33,000	100+ 种主流语言，包括中文、英文、西班牙语、法语、葡萄牙语、印尼语、日语、韩语、德语、俄语等
text-embedding-v3	1,024（默认）、768、512	10	8,192	50+ 种主流语言，包括中文、英文、西班牙语、法语、葡萄牙语、印尼语、日语、韩语、德语、俄语等

批次大小是指单次 API 调用中可处理的最大文本数量。例如，text-embedding-v4 的批次大小为 10，即单次请求最多可传入 10 条文本进行向量化，每条文本不超过 33,000 个 Token。此限制适用于：

字符串数组输入：数组最多包含 10 个元素。
文件输入：文本文件最多包含 10 行文本。

核心功能

切换向量维度

text-embedding-v4 和 text-embedding-v3 支持自定义向量维度。维度越高，保留的语义信息越丰富，但存储和计算开销也更大。

通用场景（推荐）：1024 维在性能与成本之间取得了最佳平衡，适合大多数语义检索任务。
高精度场景：对精度要求较高的领域，可选择 1536 或 2048 维。精度有一定提升，但存储和计算开销显著增加。
资源受限场景：对成本敏感的场景，可选择 768 维或更低。资源消耗显著降低，但语义信息会有一定损失。

OpenAI 兼容
DashScope

import os
from openai import OpenAI

client = OpenAI(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)

resp = client.embeddings.create(
  model="text-embedding-v4",
  input=["I like it and will buy from here again"],
  # 设置向量维度为 256
  dimensions=256
)
print(f"Embedding dimensions: {len(resp.data[0].embedding)}")

import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input=["I like it and will buy from here again"],
  # 设置向量维度为 256
  dimension=256
)

print(f"Embedding dimensions: {len(resp.output['embeddings'][0]['embedding'])}")

区分查询文本与文档文本（text_type）

该参数目前仅支持通过 DashScope SDK 和 API 启用。

在搜索类任务中，对不同类型的内容进行针对性的向量化处理，可以充分发挥各自的作用，从而获得最佳检索效果。text_type 参数就是为此设计的：

text_type: 'query'：用于用户输入的查询文本。模型生成的向量更具方向性，类似"标题"向量，专为"提问"和"检索"优化。
text_type: 'document'（默认）：用于存储在数据库中的文档文本。模型生成的向量包含更全面的信息，类似"正文"向量，专为被检索优化。

使用短文本匹配长文本时，应区分 query 和 document。对于所有文本角色相同的任务（如聚类或分类），无需设置此参数。

使用指令提升效果（instruct）

该参数目前仅支持通过 DashScope SDK 和 API 启用。

提供清晰的英文指令，可引导 text-embedding-v4 针对特定检索场景优化向量质量，有效提升精度。使用此功能时，需将 text_type 参数设置为 query。

import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

# 场景：为搜索引擎构建文档向量时，可添加指令来优化检索的向量质量。
resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input="Research papers on machine learning",
  text_type="query",
  instruct="Given a research paper query, retrieve relevant research paper"
)

稠密向量与稀疏向量

该参数目前仅支持通过 DashScope SDK 和 API 启用。

text-embedding-v4 和 text-embedding-v3 支持三种向量输出类型，满足不同检索策略的需求。

向量类型（`output_type`）	核心优势	主要不足	典型应用场景
dense	深度语义理解。能识别同义词和上下文，检索结果更相关。	计算和存储成本更高。无法保证精确的关键词匹配。	语义搜索、AI 对话、内容推荐。
sparse	计算效率高。专注于精确关键词匹配和快速过滤。	牺牲语义理解。无法处理同义词或上下文。	日志检索、商品 SKU 搜索、精确信息过滤。
dense&sparse	结合语义和关键词，获得最佳搜索效果。生成成本相同，API 调用开销与单向量模式一致。	存储需求大。系统架构和检索逻辑更复杂。	高质量生产级混合搜索引擎。

使用示例

以下代码仅供演示。在生产环境中，应预先计算向量并存储到向量数据库中。检索时只需计算查询向量。

语义搜索

通过计算查询与文档之间的向量相似度，实现精准的语义匹配。

import dashscope
import numpy as np
from dashscope import TextEmbedding

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

def cosine_similarity(a, b):
  """计算余弦相似度"""
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def semantic_search(query, documents, top_k=5):
  """执行语义搜索"""
  # 生成查询向量
  query_resp = TextEmbedding.call(
  model="text-embedding-v4",
  input=query,
  dimension=1024
  )
  query_embedding = query_resp.output['embeddings'][0]['embedding']

  # 生成文档向量
  doc_resp = TextEmbedding.call(
  model="text-embedding-v4",
  input=documents,
  dimension=1024
  )

  # 计算相似度
  similarities = []
  for i, doc_emb in enumerate(doc_resp.output['embeddings']):
    similarity = cosine_similarity(query_embedding, doc_emb['embedding'])
    similarities.append((i, similarity))

  # 排序并返回 top_k 结果
  similarities.sort(key=lambda x: x[1], reverse=True)
  return [(documents[i], sim) for i, sim in similarities[:top_k]]

# 使用示例
documents = [
  "Artificial intelligence is a branch of computer science",
  "Machine learning is an important method for achieving artificial intelligence",
  "Deep learning is a subfield of machine learning"
]
query = "What is AI?"
results = semantic_search(query, documents, top_k=2)
for doc, sim in results:
  print(f"Similarity: {sim:.3f}, Document: {doc}")

文本聚类

通过分析文本向量之间的距离，自动将相似文本归为一组。

# 需要安装 scikit-learn：pip install scikit-learn
import dashscope
import numpy as np
from sklearn.cluster import KMeans

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

def cluster_texts(texts, n_clusters=2):
  """对一组文本进行聚类"""
  # 1. 获取所有文本的向量
  resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input=texts,
  dimension=1024
  )
  embeddings = np.array([item['embedding'] for item in resp.output['embeddings']])

  # 2. 使用 KMeans 算法进行聚类
  kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init='auto').fit(embeddings)

  # 3. 整理并返回结果
  clusters = {i: [] for i in range(n_clusters)}
  for i, label in enumerate(kmeans.labels_):
    clusters[label].append(texts[i])
  return clusters


# 使用示例
documents_to_cluster = [
  "Mobile phone company A releases a new phone",
  "Search engine company B launches a new system",
  "World Cup final: Argentina vs. France",
  "China wins another gold medal at the Olympics",
  "A company releases its latest AI chip",
  "European Cup match report"
]
clusters = cluster_texts(documents_to_cluster, n_clusters=2)
for cluster_id, docs in clusters.items():
  print(f"--- Cluster {cluster_id} ---")
  for doc in docs:
    print(f"- {doc}")

文本分类

通过计算输入文本与预定义标签之间的向量相似度，无需预先标注样本即可识别和分类新类别。

import dashscope
import numpy as np

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

def cosine_similarity(a, b):
  """计算余弦相似度"""
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def classify_text_zero_shot(text, labels):
  """零样本文本分类"""
  # 1. 获取输入文本和所有标签的向量
  resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input=[text] + labels,
  dimension=1024
  )
  embeddings = resp.output['embeddings']
  text_embedding = embeddings[0]['embedding']
  label_embeddings = [emb['embedding'] for emb in embeddings[1:]]

  # 2. 计算与每个标签的相似度
  scores = [cosine_similarity(text_embedding, label_emb) for label_emb in label_embeddings]

  # 3. 返回相似度最高的标签
  best_match_index = np.argmax(scores)
  return labels[best_match_index], scores[best_match_index]


# 使用示例
text_to_classify = "The fabric of this dress is comfortable and the style is nice"
possible_labels = ["Digital Products", "Apparel & Accessories", "Food & Beverage", "Home & Living"]

label, score = classify_text_zero_shot(text_to_classify, possible_labels)
print(f"Input text: '{text_to_classify}'")
print(f"Best matching category: '{label}' (Similarity: {score:.3f})")

异常检测

通过计算文本向量与正常样本向量中心之间的相似度，识别与正常模式显著不同的异常数据。

示例代码中的阈值仅用于演示。实际业务场景中，相似度的具体数值取决于数据内容和分布，没有固定阈值。请根据自己的数据集校准该值。

import dashscope
import numpy as np

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'


def cosine_similarity(a, b):
  """计算余弦相似度"""
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))


def detect_anomaly(new_comment, normal_comments, threshold=0.6):
  # 1. 将所有正常评论和新评论向量化
  all_texts = normal_comments + [new_comment]
  resp = dashscope.TextEmbedding.call(
  model="text-embedding-v4",
  input=all_texts,
  dimension=1024
  )
  embeddings = [item['embedding'] for item in resp.output['embeddings']]

  # 2. 计算正常评论的中心向量（均值）
  normal_embeddings = np.array(embeddings[:-1])
  normal_center_vector = np.mean(normal_embeddings, axis=0)

  # 3. 计算新评论与中心向量的相似度
  new_comment_embedding = np.array(embeddings[-1])
  similarity = cosine_similarity(new_comment_embedding, normal_center_vector)

  # 4. 判断是否为异常
  is_anomaly = similarity < threshold
  return is_anomaly, similarity


# 使用示例
normal_user_comments = [
  "Today's meeting was productive",
  "The project is progressing smoothly",
  "The new version will be released next week",
  "User feedback is positive"
]

test_comments = {
  "Normal comment": "The feature works as expected",
  "Anomaly - meaningless garbled text": "asdfghjkl zxcvbnm"
}

print("--- Anomaly Detection Example ---")
for desc, comment in test_comments.items():
  is_anomaly, score = detect_anomaly(comment, normal_user_comments)
  result = "Yes" if is_anomaly else "No"
  print(f"Comment: '{comment}'")
  print(f"Is anomaly: {result} (Similarity to normal samples: {score:.3f})\n")

API 参考

关于多模态向量化，请参阅多模态向量化。

错误码

如果调用失败，请参阅错误信息。

限流

请参阅限流。

模型性能（MTEB/CMTEB）

MTEB：Massive Text Embedding Benchmark，针对分类、聚类、检索等任务的通用能力综合评测。
CMTEB：Chinese Massive Text Embedding Benchmark，专门针对中文文本的评测。
分数范围为 0 到 100，数值越高表示性能越好。

模型	MTEB	MTEB（检索任务）	CMTEB	CMTEB（检索任务）
text-embedding-v3（512 维）	62.11	54.30	66.81	71.88
text-embedding-v3（768 维）	62.43	54.74	67.90	72.29
text-embedding-v3（1024 维）	63.39	55.41	68.92	73.23
text-embedding-v4（512 维）	64.73	56.34	68.79	73.33
text-embedding-v4（1024 维）	68.36	59.30	70.14	73.98
text-embedding-v4（2048 维）	71.58	61.97	71.99	75.01

​前提条件

​获取向量