电脑端完整示例代码
import os import re import json import math import time import pyautogui import pyperclip import dashscope from PIL import Image # ===================== 步骤1:System Prompt ===================== system_prompt = """# Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> # Response format Response format for every step: 1) Action: a short imperative describing what to do in the UI. 2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}. Rules: - Output exactly in the order: Action, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those two parts. - If finishing, use action=terminate in the tool call.""" # ===================== 步骤2:构造多轮对话消息 ===================== def get_messages(image, instruction, history_output, system_prompt): history_n = 4 current_step = len(history_output) history_start_idx = max(0, current_step - history_n) previous_actions = [] for i in range(history_start_idx): if i < len(history_output): history_output_str = history_output[i]['output'] if 'Action:' in history_output_str and '<tool_call>': history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip() previous_actions.append(f"Step {i + 1}: {history_output_str}") previous_actions_str = "\\n".join(previous_actions) if previous_actions else "None" instruction_prompt = f""" Please generate the next move according to the UI screenshot, instruction and previous actions. Instruction: {instruction} Previous actions: {previous_actions_str}""" messages = [{"role": "system", "content": [{"text": system_prompt}]}] history_len = min(history_n, len(history_output)) if history_len > 0: for history_id, history_item in enumerate(history_output[-history_n:], 0): if history_id == 0: messages.append({ "role": "user", "content": [ {"text": instruction_prompt}, {"image": "file://" + history_item['image']} ] }) else: messages.append({ "role": "user", "content": [{"image": "file://" + history_item['image']}] }) messages.append({ "role": "assistant", "content": [{"text": history_item['output']}], }) messages.append({ "role": "user", "content": [{"image": "file://" + image}] }) else: messages.append({ "role": "user", "content": [ {"text": instruction_prompt}, {"image": "file://" + image} ] }) return messages # ===================== 步骤3:解析模型输出与坐标映射 ===================== def extract_tool_calls(text): pattern = re.compile(r'<tool_call>(.*?)</tool_call>', re.DOTALL | re.IGNORECASE) blocks = pattern.findall(text) actions = [] for blk in blocks: blk = blk.strip() try: actions.append(json.loads(blk)) except json.JSONDecodeError as e: print(f'解析失败: {e} | 片段: {blk[:80]}...') return actions def smart_resize(height, width, factor=32, min_pixels=32*32*4, max_pixels=32*32*1280, max_long_side=8192): def round_by_factor(number, factor): return round(number / factor) * factor def ceil_by_factor(number, factor): return math.ceil(number / factor) * factor def floor_by_factor(number, factor): return math.floor(number / factor) * factor if height < 2 or width < 2: raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}") elif max(height, width) / min(height, width) > 200: raise ValueError(f"absolute aspect ratio must be smaller than 200, got {height} / {width}") if max(height, width) > max_long_side: beta = max(height, width) / max_long_side height, width = int(height / beta), int(width / beta) h_bar = round_by_factor(height, factor) w_bar = round_by_factor(width, factor) if h_bar * w_bar > max_pixels: beta = math.sqrt((height * width) / max_pixels) h_bar = floor_by_factor(height / beta, factor) w_bar = floor_by_factor(width / beta, factor) elif h_bar * w_bar < min_pixels: beta = math.sqrt(min_pixels / (height * width)) h_bar = ceil_by_factor(height * beta, factor) w_bar = ceil_by_factor(width * beta, factor) return h_bar, w_bar # ===================== 步骤4:GUI 操作工具类 ===================== class ComputerTools: def __init__(self): self.image_info = None def load_image_info(self, path): width, height = Image.open(path).size self.image_info = (width, height) def get_screenshot(self, image_path, retry_times=3): if os.path.exists(image_path): os.remove(image_path) for i in range(retry_times): screenshot = pyautogui.screenshot() screenshot.save(image_path) if os.path.exists(image_path): self.load_image_info(image_path) return True else: time.sleep(0.1) return False def reset(self): pyautogui.hotkey('win', 'd') def press_key(self, keys): if isinstance(keys, list): cleaned_keys = [] for key in keys: if isinstance(key, str): if key.startswith("keys=["): key = key[6:] if key.endswith("]"): key = key[:-1] if key.startswith("['") or key.startswith('["'): key = key[2:] if len(key) > 2 else key if key.endswith("']") or key.endswith('"]'): key = key[:-2] if len(key) > 2 else key key = key.strip() key_map = {"arrowleft": "left", "arrowright": "right", "arrowup": "up", "arrowdown": "down"} key = key_map.get(key, key) cleaned_keys.append(key) else: cleaned_keys.append(key) keys = cleaned_keys else: keys = [keys] if len(keys) > 1: pyautogui.hotkey(*keys) else: pyautogui.press(keys[0]) def type(self, text): pyperclip.copy(text) pyautogui.keyDown('ctrl') pyautogui.keyDown('v') pyautogui.keyUp('v') pyautogui.keyUp('ctrl') def mouse_move(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.moveTo(x, y) def left_click(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.click() def left_click_drag(self, x, y): pyautogui.dragTo(x, y, duration=0.5) pyautogui.moveTo(x, y) def right_click(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.rightClick() def middle_click(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.middleClick() def double_click(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.doubleClick() def triple_click(self, x, y): pyautogui.moveTo(x, y) time.sleep(0.1) pyautogui.tripleClick() def scroll(self, pixels): pyautogui.scroll(pixels) # ===================== 步骤5:完整自动化流程 ===================== def run_gui_automation(instruction, max_step=30): dashscope.api_key = os.getenv("DASHSCOPE_API_KEY") dashscope.base_http_api_url = 'https://dashscope.aliyuncs.com/api/v1' model_name = 'gui-plus-2026-02-26' computer_tools = ComputerTools() computer_tools.reset() output_dir = os.path.join(os.path.expanduser("~"), "Desktop", "gui_automation") os.makedirs(output_dir, exist_ok=True) history = [] stop_flag = False print(f"[任务] {instruction}") print("=" * 60) for step_id in range(max_step): if stop_flag: break print(f"\n[步骤 {step_id + 1}]") screen_shot = os.path.join(output_dir, f'screenshot_{step_id}.png') computer_tools.get_screenshot(screen_shot) messages = get_messages(screen_shot, instruction, history, system_prompt) response = dashscope.MultiModalConversation.call( model=model_name, messages=messages, vl_high_resolution_images=True, stream=False ) output_text = response.output.choices[0].message.content[0]['text'] print(f"[模型输出]\n{output_text}\n") action_list = extract_tool_calls(output_text) if not action_list: print("未提取到有效操作") break for action_id, action in enumerate(action_list): action_parameter = action['arguments'] action_type = action_parameter['action'] dummy_image = Image.open(screen_shot) resized_height, resized_width = smart_resize( dummy_image.height, dummy_image.width, factor=16, min_pixels=3136, max_pixels=1003520 * 200 ) for key in ['coordinate', 'coordinate1', 'coordinate2']: if key in action_parameter: action_parameter[key][0] = int(action_parameter[key][0] / 1000 * resized_width) action_parameter[key][1] = int(action_parameter[key][1] / 1000 * resized_height) if action_type in ['click', 'left_click']: computer_tools.left_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 左键点击 ({action_parameter['coordinate'][0]}, {action_parameter['coordinate'][1]})") elif action_type == 'mouse_move': computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 移动鼠标") elif action_type == 'middle_click': computer_tools.middle_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 中键点击") elif action_type in ['right click', 'right_click']: computer_tools.right_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 右键点击") elif action_type in ['key', 'hotkey']: computer_tools.press_key(action_parameter['keys']) print(f"[OK] 按键 {action_parameter['keys']}") elif action_type == 'type': computer_tools.type(action_parameter['text']) print(f"[OK] 输入文本: {action_parameter['text']}") elif action_type == 'drag': computer_tools.left_click_drag(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 拖拽") elif action_type == 'scroll': if 'coordinate' in action_parameter: computer_tools.mouse_move(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) computer_tools.scroll(action_parameter.get("pixels", 1)) print(f"[OK] 滚动 {action_parameter.get('pixels', 1)} 像素") elif action_type in ['computer_double_click', 'double_click']: computer_tools.double_click(action_parameter['coordinate'][0], action_parameter['coordinate'][1]) print(f"[OK] 双击") elif action_type == 'wait': time.sleep(action_parameter.get('time', 2)) print(f"[OK] 等待 {action_parameter.get('time', 2)} 秒") elif action_type == 'answer': print(f"[OK] 任务完成: {action_parameter.get('text', '')}") stop_flag = True break elif action_type in ['stop', 'terminate', 'done']: print(f"[OK] 任务终止: {action_parameter.get('status', 'success')}") stop_flag = True break else: print(f"未知操作类型: {action_type}") history.append({'output': output_text, 'image': screen_shot}) time.sleep(2) print("\n" + "=" * 60) print(f"[完成] 共执行 {len(history)} 步") if __name__ == '__main__': run_gui_automation( instruction='帮我打开chrome,在百度中搜索阿里巴巴', max_step=30 )
ADB Keyboard
/path/to/adb devices
macOS/Linux
sudo chmod +x /path/to/adb
/path/to/adb shell am start -a android.intent.action.MAIN -c android.intent.category.HOME
手机端完整示例代码
System Prompt
import json, os, subprocess import dashscope, time, math from PIL import Image, ImageDraw import shutil, requests from datetime import datetime mobile_system_prompt = '''# Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots. * This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc. * Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. * The screen's resolution is 1000x1000. * Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are: * `key`: Perform a key event on the mobile device. - This supports adb's `keyevent` syntax. - Examples: "volume_up", "volume_down", "power", "camera", "clear". * `click`: Click the point on the screen with coordinate (x, y). * `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds. * `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2). * `type`: Input the specified text into the activated input box. * `system_button`: Press the system button. * `open`: Open an app on the device. * `wait`: Wait specified seconds for the change to happen. * `answer`: Terminate the current task and output the answer. * `interact`: Resolve the blocking window by interacting with the user. * `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> # Response format Response format for every step: 1) Action: a short imperative describing what to do in the UI. 2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}. Rules: - Output exactly in the order: Action, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those two parts. - If finishing, use action=terminate in the tool call.'''
from datetime import datetime def get_messages(image, instruction, history_output, system_prompt): history_n = 4 current_step = len(history_output) history_start_idx = max(0, current_step - history_n) previous_actions = [] for i in range(history_start_idx): if i < len(history_output): history_output_str = history_output[i]['output'] if 'Action:' in history_output_str and '<tool_call>': history_output_str = history_output_str.split('Action:')[1].split('<tool_call>')[0].strip() previous_actions.append(f"Step {i + 1}: {history_output_str}") previous_actions_str = ( "\n".join(previous_actions) if previous_actions else "None" ) # 添加背景信息 today = datetime.today() weekday_names = ["星期一", "星期二", "星期三", "星期四", "星期五", "星期六", "星期日"] weekday = weekday_names[today.weekday()] formatted_date = today.strftime("%Y年%m月%d日") + " " + weekday ground_info = f'''今天的日期是:{formatted_date}。''' instruction_prompt = f""" Please generate the next move according to the UI screenshot, instruction and previous actions. Instruction: {ground_info}{instruction} Previous actions: {previous_actions_str}""" ## 模型调用 messages = [ { "role": "system", "content": [ {"text": system_prompt} ], } ] history_len = min(history_n, len(history_output)) if history_len > 0: for history_id, history_item in enumerate(history_output[-history_n:], 0): if history_id == 0: messages.append({ "role": "user", "content": [ {"text": instruction_prompt}, {"image": "file://" +history_item['image']} ] }) else: messages.append({ "role": "user", "content": [ {"image": "file://" +history_item['image']} ] }) messages.append({ "role": "assistant", "content": [ {"text": history_item['output']}, ] }) messages.append({ "role": "user", "content": [ {"image": "file://" +image}, ] }) else: messages.append( { "role": "user", "content": [ { "text": instruction_prompt }, { "image": "file://" +image, }, ], } ) return messages
smart_resize
import subprocess import os import time from PIL import Image class AdbTools: def __init__(self, adb_path, device=None): self.adb_path = adb_path self.device = device self.__device_str__ = f" -s {device} " if device is not None else ' ' self.image_info = None def adb_shell(self, command): command = self.adb_path + self.__device_str__ + command subprocess.run(command, capture_output=True, text=True, shell=True) ## 载入手机size def load_image_info(self, path): width, height = Image.open(path).size self.image_info = (width, height) ## 获取截图 def get_screenshot(self, image_path, retry_times=3): command = self.adb_path + (f" -s {self.device}" if self.device is not None else '') + f" exec-out screencap -p > {image_path}" for i in range(retry_times): subprocess.run(command, capture_output=True, text=True, shell=True) if os.path.exists(image_path): self.load_image_info(image_path) return True else: time.sleep(0.1) else: return False ## 点击(x,y) ## coordinate_size: 输入图片的尺寸,默认为None,则使用当前手机的尺寸, 传入为{'x': int, 'y': int} def click(self, x, y, coordinate_size=None): command = self.adb_path + self.__device_str__ + f" shell input tap {x} {y}" subprocess.run(command, capture_output=True, text=True, shell=True) def long_press(self, x, y, time=800): command = self.adb_path + self.__device_str__ + f" shell input swipe {x} {y} {x} {y} {time}" subprocess.run(command, capture_output=True, text=True, shell=True) ## 滑动从(x1,y1)->(x2,y2) ## coordinate_size: 输入图片的尺寸,默认为None,则使用当前手机的尺寸, 传入为{'x': int, 'y': int} def slide(self, x1, y1, x2, y2, coordinate_size=None, slide_time=800): command = self.adb_path + self.__device_str__ + f" shell input swipe {x1} {y1} {x2} {y2} {slide_time}" subprocess.run(command, capture_output=True, text=True, shell=True) ## 返回 def back(self): command = self.adb_path + self.__device_str__ + f" shell input keyevent 4" subprocess.run(command, capture_output=True, text=True, shell=True) # 点击Home键 def home(self): command = self.adb_path + self.__device_str__ + f" shell am start -a android.intent.action.MAIN -c android.intent.category.HOME" subprocess.run(command, capture_output=True, text=True, shell=True) ## 打字(中英均可,不确定其他语言是否可以),注意需要先在手机安装 adb 键盘 def type(self, text): escaped_text = text.replace('"', '\\"').replace("'", "\\'") command_list = [ f"shell ime enable com.android.adbkeyboard/.AdbIME ", f"shell ime set com.android.adbkeyboard/.AdbIME ", 0.1, f'shell am broadcast -a ADB_INPUT_TEXT --es msg "{escaped_text}" ', 0.1, f"shell ime disable com.android.adbkeyboard/.AdbIME" ] for command in command_list: if isinstance(command, float): time.sleep(command) elif isinstance(command, str): subprocess.run(self.adb_path + self.__device_str__ + command.strip(), capture_output=True, text=True, shell=True) def get_package_name(self, all_packages=False): try: if all_packages: command = self.adb_path + self.__device_str__ + " shell pm list packages" else: command = self.adb_path + self.__device_str__ + " shell pm list packages -3" res = subprocess.run(command, capture_output=True, text=True, shell=True) pkgs = [] for line in res.stdout.splitlines(): s = line.strip() if not s: continue # 去掉前缀 "package:" if s.startswith("package:"): s = s[len("package:"):] # 如果包含 "=",右侧才是包名 if "=" in s: _, s = s.split("=", 1) if s: pkgs.append(s) return sorted(set(pkgs)) except Exception as e: print(e) return [] def open_app(self, package_name): command = self.adb_path + self.__device_str__ + f" shell monkey -p {package_name} -c android.intent.category.LAUNCHER 1" subprocess.run(command, capture_output=True, text=True, shell=True)
com.公司名.产品名
com.tencent.mm
action=open
# 常见应用包名映射(示例,可根据需要扩展) package_str_list = '''com.tencent.mm 微信 wechat com.tencent.mobileqq qq 腾讯qq com.sina.weibo 微博 com.taobao.taobao 淘宝 com.jingdong.app.mall 京东 京东秒送 com.xunmeng.pinduoduo 拼多多 com.xingin.xhs 小红书 com.douban.frodo 豆瓣 com.zhihu.android 知乎 com.autonavi.minimap 高德地图 高德 com.baidu.BaiduMap 百度地图 com.sankuai.meituan.takeoutnew 美团外卖 com.sankuai.meituan 美团 美团外卖 com.dianping.v1 大众点评 点评 me.ele 饿了么 淘宝闪购 com.yek.android.kfc.activitys 肯德基 ctrip.android.view 携程 携程旅行 com.MobileTicket 铁路12306 12306 com.Qunar 去哪儿旅行 去哪儿网 去哪儿 com.sdu.didi.psnger 滴滴出行 滴滴 tv.danmaku.bili bilibili b站 哔哩哔哩 哔站 bili com.ss.android.ugc.aweme 抖音 com.smile.gifmaker 快手 com.tencent.qqlive 腾讯视频 com.qiyi.video 爱奇艺 com.youku.phone 优酷 优酷视频 com.hunantv.imgo.activity 芒果tv 芒果 com.phoenix.read 红果短剧 红果 com.netease.cloudmusic 网易云音乐 网易云 com.tencent.qqmusic qq音乐 com.luna.music 汽水音乐 com.ximalaya.ting.android 喜马拉雅 com.dragon.read 番茄免费小说 番茄小说 com.kmxs.reader 七猫免费小说 com.ss.android.lark 飞书 com.tencent.androidqqmail qq邮箱 com.larus.nova 豆包 豆包 com.gotokeep.keep keep com.lingan.seeyou 美柚 com.tencent.news 腾讯新闻 com.ss.android.article.news 今日头条 com.lianjia.beike 贝壳找房 com.anjuke.android.app 安居客 com.hexin.plat.android 同花顺 com.miHoYo.hkrpg 星穹铁道 崩坏 com.papegames.lysk.cn 恋与深空 com.android.settings settings androidsystemsettings com.android.soundrecorder audiorecorder com.rammigsoftware.bluecoins bluecoins com.flauschcode.broccoli broccoli com.booking booking com.android.chrome 谷歌浏览器 googlechrome chrome com.android.deskclock 时钟 闹钟 clock com.android.contacts contacts com.duolingo duolingo 多邻国 com.expedia.bookings expedia com.android.fileexplorer files filemanager com.google.android.gm gmail googlemail com.google.android.apps.nbu.files googlefiles filesbygoogle com.google.android.calendar googlecalendar com.google.android.apps.dynamite googlechat com.google.android.deskclock googleclock com.google.android.contacts googlecontacts com.google.android.apps.docs.editors.docs googledocs com.google.android.apps.docs googledrive com.google.android.apps.fitness googlefit com.google.android.keep googlekeep com.google.android.apps.maps googlemaps com.google.android.apps.books googleplaybooks com.android.vending googleplaystore com.google.android.apps.docs.editors.slides googleslides com.google.android.apps.tasks googletasks net.cozic.joplin joplin com.mcdonalds.app 麦当劳 mcdonald net.osmand osmand com.Project100Pi.themusicplayer pimusicplayer com.quora.android quora com.reddit.frontpage reddit code.name.monkey.retromusic retromusic com.scientificcalculatorplus.simplecalculator.basiccalculator.mathcalc simplecalendarpro com.simplemobiletools.smsmessenger simplesmsmessenger org.telegram.messenger telegram com.einnovation.temu temu com.zhiliaoapp.musically tiktok com.twitter.android twitter x org.videolan.vlc vlc com.whatsapp whatsapp com.taobao.movie.android 淘票票 com.tongcheng.android 同程旅行 同程 com.sankuai.movie 猫眼 com.wuba.zhuanzhuan 转转 com.tencent.weread 微信读书 com.taobao.idlefish 闲鱼 com.wudaokou.hippo 盒马 com.eg.android.AlipayGphone 支付宝 com.jd.jrapp 京东金融 com.achievo.vipshop 唯品会 com.smzdm.client.android 什么值得买 cn.kuwo.player 酷我音乐 com.taobao.trip 飞猪 飞猪旅行 com.jingdong.pdj 京东到家 com.tencent.map 腾讯地图 com.shizhuang.duapp 得物 cn.damai 大麦 大麦网 com.ss.android.auto 懂车帝 com.cubic.autohome 汽车之家 com.wuba 58同城 五八同城 com.android.calendar 日历 com.alibaba.android.rimet 钉钉 com.meituan.retail.v.android 小象超市 com.aliyun.tongyi 通义 千问 通义千问 com.hupu.games 虎扑 虎扑体育 com.quark.browser 夸克 夸克浏览器 com.yuantiku.tutor 猿辅导 com.tencent.mtt qq浏览器 com.umetrip.android.msky.app 航旅纵横 com.UCMobile UC浏览器 com.ss.android.ugc.aweme.lite 抖音极速版 抖音 air.tv.douyu.android 斗鱼 com.tencent.hunyuan.app.chat 元宝 com.baidu.searchbox 百度 com.lemon.lv 剪映 cn.soulapp.android soul com.baidu.netdisk 百度网盘 com.tmri.app.main 交管12123 12123 com.kugou.android 酷狗 酷狗音乐 com.ss.android.lark 飞书 com.tencent.android.qqdownloader 应用宝 com.mt.mtxx.mtxx 美图 美图秀秀 com.tencent.karaoke 全民k歌 com.intsig.camscanner 扫描全能王 com.android.bankabc 农业银行 农行 cmb.pb 招商银行 招行 com.ganji.android.haoche_c 瓜子二手车 瓜子 com.sf.activity 顺丰 顺丰快递 顺丰速运 com.ziroom.ziroomcustomer 自如 com.yumc.phsuperapp 必胜客 cn.dominos.pizza 达美乐披萨 达美乐 cn.wps.moffice_eng WPS Office WPS com.mfw.roadbook 马蜂窝 com.moonshot.kimichat kimi com.tencent.wemeet.app 腾讯会议 com.deepseek.chat deepseek com.spdbccc.app 浦发银行 cn.samsclub.app 山姆超市 山姆 山姆会员商店 山姆会员店 com.tencent.qqsports 腾讯体育 com.hanweb.android.zhejiang.activity 浙里办 com.ss.android.article.video 西瓜视频 com.taou.maimai 脉脉 ''' PACKAGES_NAME_DICT = {} NAME_PACKAGE_DICT = {} def normalize_package_name(name): name = name.lower().strip().replace(" ", "").replace("-", "") return name for package_str in package_str_list.split("\n"): package_name = package_str.strip().split("\t") PACKAGES_NAME_DICT[package_name[0]] = [normalize_package_name(i) for i in package_name[1:]] for name in package_name[1:]: name = normalize_package_name(name) if name not in NAME_PACKAGE_DICT: NAME_PACKAGE_DICT[name] = [package_name[0]] else: NAME_PACKAGE_DICT[name].append(package_name[0])
pip install playwright pillow dashscope playwright-stealth termcolor
playwright install chromium
浏览器端完整示例代码
调用模型
gui-plus
"""# Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"type": "function", "function": {"name": "computer_use", "description": "Use a mouse and keyboard to interact with a computer, and take screenshots.\\n* This is an interface to a desktop GUI. You do not have access to a terminal or applications menu. You must click on desktop icons to start applications.\\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions. E.g. if you click on Firefox and a window doesn't open, try wait and taking another screenshot.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `key`: Performs key down presses on the arguments passed in order, then performs key releases in reverse order.\\n* `type`: Type a string of text on the keyboard.\\n* `mouse_move`: Move the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `left_click`: Click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `left_click_drag`: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.\\n* `right_click`: Click the right mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `middle_click`: Click the middle mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `double_click`: Double-click the left mouse button at a specified (x, y) pixel coordinate on the screen.\\n* `triple_click`: Triple-click the left mouse button at a specified (x, y) pixel coordinate on the screen (simulated as double-click since it's the closest action).\\n* `scroll`: Performs a scroll of the mouse scroll wheel.\\n* `hscroll`: Performs a horizontal scroll (mapped to regular scroll).\\n* `wait`: Wait specified seconds for the change to happen.\\n* `terminate`: Terminate the current task and report its completion status.\\n* `answer`: Answer a question.\\n* `interact`: Resolve the blocking window by interacting with the user.", "enum": ["key", "type", "mouse_move", "left_click", "left_click_drag", "right_click", "middle_click", "double_click", "triple_click", "scroll", "hscroll", "wait", "terminate", "answer", "interact"], "type": "string"}, "keys": {"description": "Required only by `action=key`.", "type": "array"}, "text": {"description": "Required only by `action=type`, `action=answer` and `action=interact`.", "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=mouse_move` and `action=left_click_drag`.", "type": "array"}, "pixels": {"description": "The amount of scrolling to perform. Positive values scroll up, negative values scroll down. Required only by `action=scroll` and `action=hscroll`.", "type": "number"}, "time": {"description": "The seconds to wait. Required only by `action=wait`.", "type": "number"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> # Response format Response format for every step: 1) Action: a short imperative describing what to do in the UI. 2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}. Rules: - Output exactly in the order: Action, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those two parts. - If finishing, use action=terminate in the tool call."""
'''# Tools You may call one or more functions to assist with the user query. You are provided with function signatures within <tools></tools> XML tags: <tools> {"type": "function", "function": {"name_for_human": "mobile_use", "name": "mobile_use", "description": "Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n - This supports adb's `keyevent` syntax.\n - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `answer`: Terminate the current task and output the answer.\n* `interact`: Resolve the blocking window by interacting with the user.\n* `terminate`: Terminate the current task and report its completion status.", "enum": ["key", "click", "long_press", "swipe", "type", "system_button", "open", "wait", "answer", "interact", "terminate"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.", "type": "array"}, "coordinate2": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.", "type": "array"}, "text": {"description": "Required only by `action=key`, `action=type`, `action=open`, `action=answer`,and `action=interact`.", "type": "string"}, "time": {"description": "The seconds to wait. Required only by `action=long_press` and `action=wait`.", "type": "number"}, "button": {"description": "Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`", "enum": ["Back", "Home", "Menu", "Enter"], "type": "string"}, "status": {"description": "The status of the task. Required only by `action=terminate`.", "type": "string", "enum": ["success", "failure"]}}, "required": ["action"], "type": "object"}, "args_format": "Format the arguments as a JSON object."}} </tools> For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags: <tool_call> {"name": <function-name>, "arguments": <args-json-object>} </tool_call> # Response format Response format for every step: 1) Action: a short imperative describing what to do in the UI. 2) A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args-json-object>}. Rules: - Output exactly in the order: Action, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those two parts. - If finishing, use action=terminate in the tool call.'''