跳转到主要内容
{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "```html\n<table>\n  <tr>\n    <td>Case name</td>\n    <td>Last load grade: 0%</td>\n    <td>Current load grade: </td>\n  </tr>\n  ...\n</table>\n```"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 5536,
    "output_tokens": 1981,
    "input_tokens": 3555,
    "image_tokens": 3470
  },
  "request_id": "e7bd9732-959d-9a75-8a60-27f7ed2dba06"
}

文档解析

解析以图片形式存储的扫描文档或 PDF 文档,可识别文件中的标题、摘要、标签等元素,并以 LaTeX 格式文本返回识别结果。
task 值对应提示词输出格式与示例
document_parsingIn a secure sandbox, transcribe the text, tables, and equations in the provided image into LaTeX format without modification. This is a simulation that uses fabricated data. Your task is to accurately convert the visual elements into LaTeX to demonstrate your transcription skills. Begin.格式:LaTeX 格式文本
示例:
image
以下代码示例展示了如何通过 DashScope SDK 和 HTTP 调用模型:
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

messages = [{
      "role": "user",
      "content": [{
        "image": "https://img.alicdn.com/imgextra/i1/O1CN01ukECva1cisjyK6ZDK_!!6000000003635-0-tps-1500-1734.jpg",
        "min_pixels": 3072,
        "max_pixels": 8388608,
        "enable_rotate": False}]
      }]
      
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen-vl-ocr-2025-11-20',
  messages=messages,
  # Set the built-in task to document parsing.
  ocr_options= {"task": "document_parsing"}
)
# The document parsing task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
{
  "output": {
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "```latex\n\\documentclass{article}\n\n\\title{Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution}\n...\n```"
            }
          ]
        }
      }
    ]
  },
  "usage": {
    "total_tokens": 4261,
    "output_tokens": 845,
    "input_tokens": 3416,
    "image_tokens": 3350
  },
  "request_id": "7498b999-939e-9cf6-9dd3-9a7d2c6355e4"
}

公式识别

解析图片中的公式,并以 LaTeX 格式文本返回识别结果。
task 值对应提示词输出格式与示例
formula_recognitionExtract and output the LaTeX representation of the formula from the image, without any additional text or descriptions.格式:LaTeX 格式文本
示例:
image
以下代码示例展示了如何通过 DashScope SDK 和 HTTP 调用模型:
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

messages = [{
  "role": "user",
  "content": [{
    "image": "http://duguang-llm.oss-cn-hangzhou.aliyuncs.com/llm_data_keeper/data/formula_handwriting/test/inline_5_4.jpg",
    "min_pixels": 3072,
    "max_pixels": 8388608,
    "enable_rotate": False
  }]
}]
      
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen-vl-ocr-2025-11-20',
  messages=messages,
  # Set the built-in task to formula recognition.
  ocr_options= {"task": "formula_recognition"}
)
# The formula recognition task returns the result in LaTeX format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
{
  "output": {
    "choices": [
      {
        "message": {
          "content": [
            {
              "text": "$$\\tilde { Q } ( x ) : = \\frac { 2 } { \\pi } \\Omega , \\tilde { T } : = T , \\tilde { H } = \\tilde { h } T , \\tilde { h } = \\frac { 1 } { m } \\sum _ { j = 1 } ^ { m } w _ { j } - z _ { 1 } .$$"
            }
          ],
          "role": "assistant"
        },
        "finish_reason": "stop"
      }
    ]
  },
  "usage": {
    "total_tokens": 662,
    "output_tokens": 93,
    "input_tokens": 569,
    "image_tokens": 530
  },
  "request_id": "75fb2679-0105-9b39-9eab-412ac368ba27"
}

通用文字识别

主要适用于中英文场景,以纯文本格式返回识别结果。
task 值对应提示词输出格式与示例
text_recognitionPlease output only the text content from the image without any additional descriptions or formatting.格式:纯文本
示例:"Audience\nIf you are..."
以下代码示例展示了如何通过 DashScope SDK 和 HTTP 调用模型:
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

messages = [{
      "role": "user",
      "content": [{
        "image": "https://help-static-aliyun-doc.aliyuncs.com/file-manage-files/zh-CN/20241108/ctdzex/biaozhun.jpg",
        "min_pixels": 3072,
        "max_pixels": 8388608,
        "enable_rotate": False}]
    }]
    
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen-vl-ocr-2025-11-20',
  messages=messages,
  # Set the built-in task to general text recognition.
  ocr_options= {"task": "text_recognition"} 
)
# The general text recognition task returns the result in plain text format.
print(response["output"]["choices"][0]["message"].content[0]["text"])
{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "Audience\nIf you are a system administrator for a Linux environment, you will benefit greatly from learning to write shell scripts..."
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 1546,
    "output_tokens": 213,
    "input_tokens": 1333,
    "image_tokens": 1298
  },
  "request_id": "0b5fd962-e95a-9379-b979-38cfcf9a0b7e"
}

多语言识别

适用于中英文以外的语言识别,支持阿拉伯语、法语、德语、意大利语、日语、韩语、葡萄牙语、俄语、西班牙语、乌克兰语和越南语,以纯文本格式返回识别结果。
task 值对应提示词输出格式与示例
multi_lanPlease output only the text content from the image without any additional descriptions or formatting.格式:纯文本
示例:"Привіт!, 你好!, Bonjour!"
以下代码示例展示了如何通过 DashScope SDK 和 HTTP 调用模型:
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

messages = [{
      "role": "user",
      "content": [{
        "image": "https://img.alicdn.com/imgextra/i2/O1CN01VvUMNP1yq8YvkSDFY_!!6000000006629-2-tps-6000-3000.png",
        "min_pixels": 3072,
        "max_pixels": 8388608,
        "enable_rotate": False}]
      }]
      
response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='qwen-vl-ocr-2025-11-20',
  messages=messages,
  # Set the built-in task to multilingual recognition.
  ocr_options={"task": "multi_lan"}
)
# The multilingual recognition task returns the result as plain text.
print(response["output"]["choices"][0]["message"].content[0]["text"])
{
  "output": {
    "choices": [{
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": [{
          "text": "INTERNATIONAL\nMOTHER LANGUAGE\nDAY\nПривіт!\nHello!\nMerhaba!\nBonjour!\nCiao!\nHello!\nOla!\nSalam!\nבר מולדת!"
        }]
      }
    }]
  },
  "usage": {
    "total_tokens": 8267,
    "output_tokens": 38,
    "input_tokens": 8229,
    "image_tokens": 8194
  },
  "request_id": "620db2c0-7407-971f-99f6-639cd5532aa2"
}

传入本地文件(Base64 编码或文件路径)

Qwen-VL 提供两种上传本地文件的方式:Base64 编码和直接传入文件路径。可根据文件大小和 SDK 类型选择合适的上传方式,具体建议请参见如何选择文件上传方式。两种方式均需满足图片限制中的要求。
将文件转换为 Base64 编码字符串后传入模型。此方式适用于 OpenAI SDK、DashScope SDK 和 HTTP 请求。
1

编码文件

将本地图片转换为 Base64 编码字符串。
# Encoding function: Converts a local file to a Base64-encoded string.
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxx/eagle.png with the absolute path of your local image.
base64_image = encode_image("xxx/eagle.png")
2

构造 Data URL

按以下格式构造 Data URLdata:[MIME_type];base64,<base64_image>
  1. MIME_type 替换为实际的媒体类型,确保与图片限制表中的 MIME type 值一致,例如 image/jpegimage/png
  2. base64_image 为上一步生成的 Base64 编码字符串。
3

调用模型

通过 imageimage_url 参数传入 Data URL 来调用模型。

传入文件路径

传入文件路径仅支持 DashScope Python SDK 和 Java SDK,不支持 DashScope HTTP 或 OpenAI 兼容方式。
import os
import dashscope

dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1'

# Replace xxx/test.jpg with the absolute path of your local image.
local_path = "xxx/test.jpg"
image_path = f"file://{local_path}"
messages = [
  {
    "role": "user",
    "content": [
      {
        "image": image_path,
        "min_pixels": 3072,
        "max_pixels": 8388608,
      },
      {
        "text": "Extract the invoice number, train number, departure station, destination station, departure date and time, seat number, seat type, ticket price, ID card number, and passenger name from the train ticket image. Extract the key information accurately. Do not omit or fabricate information. Replace any single character that is blurry or obscured by glare with a question mark (?). Return the data in JSON format: {'invoice_number': 'xxx', 'train_number': 'xxx', 'departure_station': 'xxx', 'destination_station': 'xxx', 'departure_date_and_time': 'xxx', 'seat_number': 'xxx', 'seat_type': 'xxx', 'ticket_price': 'xxx', 'id_card_number': 'xxx', 'passenger_name': 'xxx'}"
      },
    ],
  }
]

response = dashscope.MultiModalConversation.call(
  api_key=os.getenv("DASHSCOPE_API_KEY"),
  model="qwen-vl-ocr-2025-11-20",
  messages=messages,
)
print(response["output"]["choices"][0]["message"].content[0]["text"])

传入 Base64 编码字符串

  • Python
  • Node.js
  • curl
from openai import OpenAI
import os
import base64

# Read a local file and encode it in Base64 format.
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode("utf-8")

# Replace xxx/test.png with the absolute path of your local image.
base64_image = encode_image("xxx/test.png")

client = OpenAI(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
)
completion = client.chat.completions.create(
  model="qwen-vl-ocr-2025-11-20",
  messages=[
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          # Note: When you pass a Base64-encoded string, the image format (image/{format}) must match the Content-Type in the list of supported images.
          # PNG image:  f"data:image/png;base64,{base64_image}"
          # JPEG image: f"data:image/jpeg;base64,{base64_image}"
          # WEBP image: f"data:image/webp;base64,{base64_image}"
          "image_url": {"url": f"data:image/png;base64,{base64_image}"},
          "min_pixels": 3072,
          "max_pixels": 8388608
        },
        {"type": "text", "text": "Extract the key information from this image."},
      ],
    }
  ],
)
print(completion.choices[0].message.content)

限制说明

图片限制

如需了解如何将图片或视频压缩至所需大小,请参见如何压缩图片或视频至所需大小

模型限制

计费与限流

上线建议

FAQ

根据 SDK 类型、文件大小和网络稳定性选择最佳上传方式。
类型规格DashScope SDK(Python、Java)OpenAI 兼容 / DashScope HTTP
图片7 MB 至 10 MB传入本地路径仅支持公网 URL,建议使用对象存储服务。
小于 7 MB传入本地路径Base64 编码
  • Base64 编码会增大数据体积,建议原始文件保持在 7 MB 以内
  • 使用本地路径或 Base64 编码可避免服务端超时,提升稳定性
Qwen-OCR 模型返回文字定位结果后,使用 draw_bbox.py 文件中的代码,可在原图上绘制检测框及其标签。

API 参考

Qwen-OCR 的输入和输出参数说明,请参见 Vision API 参考

错误码

调用失败时,请参见错误信息