GUI-Plus

GUI-Plus DashScope

通过 DashScope 原生 HTTP API 调用 GUI-Plus 界面交互专用模型。

POST

/api/v1/services/aigc/multimodal-generation/generation

import os
import dashscope

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.\n* `answer`: Answer a question.\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

messages = [
  {
    "role": "system",
    "content": system_prompt
  },
  {
    "role": "user",
    "content": [
      {"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
      {"text": "帮我打开浏览器。"}]
  }]

response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='gui-plus-2026-02-26',
  messages=messages,
  vl_high_resolution_images=True
)

print(response.output.choices[0].message.content[0]["text"])

{
  "status_code": 200,
  "request_id": "b74b3a25-3968-4059-8c44-63d793c07f02",
  "code": "",
  "message": "",
  "output": {
    "text": null,
    "finish_reason": null,
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "```json\n{\"thought\": \"用户想要打开浏览器，我观察到屏幕截图中有一个Google Chrome的图标，其位置在右上角一排的最后一个。因此，下一步操作应该是点击这个Chrome浏览器图标来启动它。\", \"action\": \"CLICK\", \"parameters\": {\"x\": 1086, \"y\": 127}}\n```"
            }
          ]
        }
      }
    ],
    "audio": null
  },
  "usage": {
    "input_tokens": 2021,
    "output_tokens": 78,
    "characters": 0,
    "image_tokens": 1244,
    "input_tokens_details": {
      "image_tokens": 1244,
      "text_tokens": 777
    },
    "output_tokens_details": {
      "text_tokens": 78
    },
    "total_tokens": 2099
  }
}

鉴权

string

header

必填

千问云 API Key。详见获取 API Key。

请求体

application/json

enum<string>

必填

模型名称。

可选值：gui-plus,gui-plus-2026-02-26

object

必填

显示子属性

object[]

必填

模型的对话历史，按时间顺序排列。

Option 1
Option 2
Option 3

显示子属性

enum<string>

必填

消息角色，固定为 system。

可选值：system

string

必填

系统消息内容。可以是字符串或包含 {"text": "..."} 的数组。

object

显示子属性

boolean

默认值false

是否将输入图像的像素上限提升至 16384 Token 对应的像素量。

boolean

是否开启思考模式。仅 gui-plus-2026-02-26 支持。SDK 参数名：enableThinking。

integer

限制模型输出的最大 Token 数。SDK 参数名：maxTokens。

integer

随机数种子，范围 [0, 2^31-1]。

number

默认值0.01

采样温度。取值范围 [0, 2)。temperature 与 top_p 二者只需设置其一。

number

默认值0.01

核采样的概率阈值。取值范围 (0, 1.0]。SDK 参数名：topP。

integer

默认值1

采样候选集的大小。SDK 参数名：topK。

number

默认值1

连续序列中的重复度惩罚。1.0 表示不惩罚。SDK 参数名：repetitionPenalty。

number

默认值1.5

控制生成文本的内容重复度。取值范围 [-2.0, 2.0]。

boolean

默认值false

流式输出模式下是否开启增量输出，推荐设置为 true。false：每个数据块包含从开始到当前的所有生成内容（累积输出）。true：每个数据块仅包含本次新增内容（增量输出）。SDK 参数名：incrementalOutput。

string

停止词。当模型生成的文本中出现指定字符串或 token_id 时，生成立即终止。

响应

200-application/json

integer

本次请求的状态码。200 表示成功。Java SDK 不返回该参数，调用失败会抛出异常。

string

本次调用的唯一标识符。Java SDK 返回参数为 requestId。

string

错误码，调用成功时为空值。仅 Python SDK 返回该参数。

string

错误信息。

object

显示子属性

null

string | null

生成过程中为 null；自然停止为 stop；超出 max_tokens 为 length。

object[]

显示子属性

enum<string>

可选值：stop,length

object

显示子属性

enum<string>

可选值：assistant

object[]

显示子属性

string

模型生成的界面操作指令。

null

object

显示子属性

integer

输入 Token 数。

integer

输出 Token 数。

integer

总 Token 数。

integer

图像内容占用的 Token 数。

integer

object

显示子属性

integer

object

显示子属性

integer

import os
import dashscope

system_prompt = """# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\n* `type`: Type a string of text on the keyboard.\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\n* `scroll`: Performs a scroll of the mouse scroll wheel.\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\n* `wait`: Wait specified seconds for the change to happen.\n* `terminate`: Terminate the current task and report its completion status.\n* `answer`: Answer a question.\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

# Response format

Response format for every step:
1) Action: a short imperative describing what to do in the UI.
2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}.

Rules:
- Output exactly in the order: Action, <tool_call>.
- Be brief: one for Action.
- Do not output anything else outside those two parts.
- If finishing, use action=terminate in the tool call."""

messages = [
  {
    "role": "system",
    "content": system_prompt
  },
  {
    "role": "user",
    "content": [
      {"image": "https://img.alicdn.com/imgextra/i2/O1CN016iJ8ob1C3xP1s2M6z_!!6000000000026-2-tps-3008-1758.png"},
      {"text": "帮我打开浏览器。"}]
  }]

response = dashscope.MultiModalConversation.call(
  api_key=os.getenv('DASHSCOPE_API_KEY'),
  model='gui-plus-2026-02-26',
  messages=messages,
  vl_high_resolution_images=True
)

print(response.output.choices[0].message.content[0]["text"])

{
  "status_code": 200,
  "request_id": "b74b3a25-3968-4059-8c44-63d793c07f02",
  "code": "",
  "message": "",
  "output": {
    "text": null,
    "finish_reason": null,
    "choices": [
      {
        "finish_reason": "stop",
        "message": {
          "role": "assistant",
          "content": [
            {
              "text": "```json\n{\"thought\": \"用户想要打开浏览器，我观察到屏幕截图中有一个Google Chrome的图标，其位置在右上角一排的最后一个。因此，下一步操作应该是点击这个Chrome浏览器图标来启动它。\", \"action\": \"CLICK\", \"parameters\": {\"x\": 1086, \"y\": 127}}\n```"
            }
          ]
        }
      }
    ],
    "audio": null
  },
  "usage": {
    "input_tokens": 2021,
    "output_tokens": 78,
    "characters": 0,
    "image_tokens": 1244,
    "input_tokens_details": {
      "image_tokens": 1244,
      "text_tokens": 777
    },
    "output_tokens_details": {
      "text_tokens": 78
    },
    "total_tokens": 2099
  }
}