POST
/api/v1/services/aigc/multimodal-generation/generation
复制
import os
import dashscope
system_prompt = """# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.\n* `answer`: Answer a question.\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
# Response format
Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.
Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
{"text": "帮我打开浏览器。"}]
}]
response = dashscope.MultiModalConversation.call(
api_key=os.getenv('DASHSCOPE_API_KEY'),
model='gui-plus-2026-02-26',
messages=messages,
vl_high_resolution_images=True
)
print(response.output.choices[0].message.content[0]["text"])复制
{
"status_code": 200,
"request_id": "b74b3a25-3968-4059-8c44-63d793c07f02",
"code": "",
"message": "",
"output": {
"text": null,
"finish_reason": null,
"choices": [
{
"finish_reason": "stop",
"message": {
"role": "assistant",
"content": [
{
"text": "```json\n{\"thought\": \"用户想要打开浏览器,我观察到屏幕截图中有一个Google Chrome的图标,其位置在右上角一排的最后一个。因此,下一步操作应该是点击这个Chrome浏览器图标来启动它。\", \"action\": \"CLICK\", \"parameters\": {\"x\": 1086, \"y\": 127}}\n```"
}
]
}
}
],
"audio": null
},
"usage": {
"input_tokens": 2021,
"output_tokens": 78,
"characters": 0,
"image_tokens": 1244,
"input_tokens_details": {
"image_tokens": 1244,
"text_tokens": 777
},
"output_tokens_details": {
"text_tokens": 78
},
"total_tokens": 2099
}
}鉴权
string
header
必填
千问云 API Key。详见获取 API Key。
请求体
application/jsonenum<string>
必填
模型名称。
可选值:gui-plus,gui-plus-2026-02-26
object
必填
object
显示子属性
显示子属性
boolean
默认值false
是否将输入图像的像素上限提升至 16384 Token 对应的像素量。
boolean
是否开启思考模式。仅 gui-plus-2026-02-26 支持。SDK 参数名:enableThinking。
integer
限制模型输出的最大 Token 数。SDK 参数名:maxTokens。
integer
随机数种子,范围 [0, 2^31-1]。
number
默认值0.01
采样温度。取值范围 [0, 2)。temperature 与 top_p 二者只需设置其一。
number
默认值0.01
核采样的概率阈值。取值范围 (0, 1.0]。SDK 参数名:topP。
integer
默认值1
采样候选集的大小。SDK 参数名:topK。
number
默认值1
连续序列中的重复度惩罚。1.0 表示不惩罚。SDK 参数名:repetitionPenalty。
number
默认值1.5
控制生成文本的内容重复度。取值范围 [-2.0, 2.0]。
boolean
默认值false
流式输出模式下是否开启增量输出,推荐设置为 true。false:每个数据块包含从开始到当前的所有生成内容(累积输出)。true:每个数据块仅包含本次新增内容(增量输出)。SDK 参数名:incrementalOutput。
string
停止词。当模型生成的文本中出现指定字符串或 token_id 时,生成立即终止。
响应
200-application/json
integer
本次请求的状态码。200 表示成功。Java SDK 不返回该参数,调用失败会抛出异常。
string
本次调用的唯一标识符。Java SDK 返回参数为 requestId。
string
错误码,调用成功时为空值。仅 Python SDK 返回该参数。
string
错误信息。
object

