通义千问3-VL-Plus - 界面交互（本地图片）-酒店常州论坛

一、前言

在前文通义千问3-VL-Plus - 界面交互-CSDN博客之后，我改装一下代码，让本地图片可以被识别。

整体改造思路

兼容本地图片：新增本地图片路径参数，通过 Base64 编码将本地图片转为 GUI-Plus 支持的格式；
保留原有逻辑：维持「文本 + 网络图片 URL」的非流式调用，兼容原有接口；
新增 SSE 流式输出：基于 GUI-Plus 模型的流式调用能力，实现 SSE 实时推送结果；
修复原有问题：修正 API Key 使用矛盾、Base64 编码错误、提示词可读性差、空指针风险等问题；
统一异常处理：新增全局异常处理，保证接口健壮性。

二、代码整改

1. Request 请求类（兼容本地图片 + 原有字段）

package gzj.spring.ai.Request; import io.swagger.v3.oas.annotations.media.Schema; import lombok.Data; /** * @author DELL */ @Data @Schema(description = "GUI-Plus操作解析请求参数") public class OparetionRequest { @Schema(description = "用户自然语言指令（如：点击桌面Chrome图标）", required = true) private String text; @Schema(description = "网络图片URL（与localImagePath二选一）") private String imageUrl; @Schema(description = "本地图片绝对路径（如：E:\\test.png，与imageUrl二选一）") private String localImagePath; }

2. Service 接口（新增流式方法）

package gzj.spring.ai.Service; import com.alibaba.dashscope.exception.ApiException; import com.alibaba.dashscope.exception.NoApiKeyException; import com.alibaba.dashscope.exception.UploadFileException; import gzj.spring.ai.Request.OparetionRequest; import org.springframework.web.servlet.mvc.method.annotation.SseEmitter; import java.io.IOException; /** * @author DELL */ public interface OparetionService { /** * 非流式调用（同步返回结果） */ String operation(OparetionRequest request) throws ApiException, NoApiKeyException, UploadFileException, IOException; /** * SSE流式调用（实时推送结果） */ SseEmitter streamOperation(OparetionRequest request); }

3. Service 实现类（核心改造：本地图片 + 流式 + 原有逻辑）

package gzj.spring.ai.Service.ServiceImpl; import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversation; import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationParam; import com.alibaba.dashscope.aigc.multimodalconversation.MultiModalConversationResult; import com.alibaba.dashscope.common.MultiModalMessage; import com.alibaba.dashscope.common.Role; import com.alibaba.dashscope.exception.ApiException; import com.alibaba.dashscope.exception.NoApiKeyException; import com.alibaba.dashscope.exception.UploadFileException; import gzj.spring.ai.Request.OparetionRequest; import gzj.spring.ai.Service.OparetionService; import io.reactivex.Flowable; import org.springframework.beans.factory.annotation.Value; import org.springframework.stereotype.Service; import org.springframework.web.servlet.mvc.method.annotation.SseEmitter; import java.io.IOException; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.*; import static com.alibaba.cloud.ai.graph.utils.TryConsumer.log; /** * @author DELL */ @Service public class OparetionServiceImpl implements OparetionService { @Value("${spring.ai.dashscope.api-key}") private String apiKey; @Value("${spring.ai.dashscope.modelV2:gui-plus}") private String modelName; // 模型名配置化，便于切换 /** * 工具方法：本地图片转Base64（带data:image前缀，GUI-Plus支持格式） */ private String encodeLocalImageToBase64(String localPath) throws IOException { Path imagePath = Paths.get(localPath); // 校验文件存在性 if (!Files.exists(imagePath)) { throw new IOException("本地图片不存在：" + localPath); } // 读取文件并Base64编码（修复原有编码错误） byte[] imageBytes = Files.readAllBytes(imagePath); String base64Str = Base64.getEncoder().encodeToString(imageBytes); // 自动识别图片格式 String suffix = localPath.substring(localPath.lastIndexOf(".") + 1).toLowerCase(); if (!Arrays.asList("png", "jpg", "jpeg").contains(suffix)) { suffix = "png"; // 默认PNG } return String.format("data:image/%s;base64,%s", suffix, base64Str); } /** * 工具方法：构建图片内容（优先级：本地图片 > 网络URL） */ private String buildImageContent(OparetionRequest request) throws IOException { if (request.getLocalImagePath() != null && !request.getLocalImagePath().isEmpty()) { log.info("使用本地图片：{}", request.getLocalImagePath()); return encodeLocalImageToBase64(request.getLocalImagePath()); } else if (request.getImageUrl() != null && !request.getImageUrl().isEmpty()) { log.info("使用网络图片URL：{}", request.getImageUrl()); return request.getImageUrl(); } else { throw new IllegalArgumentException("必须传入imageUrl（网络图片）或localImagePath（本地图片）"); } } /** * 构建GUI-Plus核心提示词（优化为Text Blocks，提升可读性） */ private String buildSystemPrompt() { return """ ## 1. 核心角色 (Core Role) 你是一个顶级的AI视觉操作代理。你的任务是分析电脑屏幕截图，理解用户的指令，然后将任务分解为单一、精确的GUI原子操作。 ## 1.1 环境情况 - [R1] 用户的桌面: 用户显示器分辨率为 1920×1080 缩放125% 。 - [R2] 用户的桌面: 用户拥有两个屏幕显示，只需要看主屏幕（也就是屏幕1）的内容和定位就好了。 ## 2. [CRITICAL] JSON Schema & 绝对规则 你的输出必须是一个严格符合以下规则的JSON对象。任何偏差都将导致失败。 - [R1] 严格的JSON: 回复必须是且只能是一个JSON对象，禁止添加任何文本、注释或解释。 - [R2] 严格的Parameters结构: thought字段用一句话描述思考过程（如：用户想打开Chrome，我看到桌面图标，所以点击它）。 - [R3] 精确的Action值: action字段必须是大写字符串（CLICK/TYPE/SCROLL/KEY_PRESS/FINISH/FAIL），无空格、大小写错误。 - [R4] 严格的Parameters结构: parameters对象必须与所选Action的模板完全一致（键名、值类型精准匹配）。 ## 3. 工具集 (Available Actions) ### CLICK - 功能: 单击屏幕。 - Parameters模板: { "x": <integer>, "y": <integer>, "description": "<string, optional: 描述点击对象>" } ### TYPE - 功能: 输入文本。 - Parameters模板: { "text": "<string>", "needs_enter": <boolean> } ### SCROLL - 功能: 滚动窗口。 - Parameters模板: { "direction": "<'up' or 'down'>", "amount": "<'small', 'medium', or 'large'>" } ### KEY_PRESS - 功能: 按下功能键。 - Parameters模板: { "key": "<string: e.g., 'enter', 'esc', 'alt+f4'>" } ### FINISH - 功能: 任务成功完成。 - Parameters模板: { "message": "<string: 总结任务完成情况>" } ### FAIL - 功能: 任务无法完成。 - Parameters模板: { "reason": "<string: 清晰解释失败原因>" } ## 4. 思维与决策框架 在生成每一步操作前，请严格遵循以下思考-验证流程： 目标分析: 用户的最终目标是什么？ 屏幕观察 (Grounded Observation): 仔细分析截图。你的决策必须基于截图中存在的视觉证据。 如果你看不见某个元素，你就不能与它交互。 行动决策: 基于目标和可见的元素，选择最合适的工具。 构建输出: a. 在thought字段中记录你的思考。 b. 选择一个action。 c. 精确复制该action的parameters模板，并填充值。 最终验证 (Self-Correction): 在输出前，最后检查一遍： 我的回复是纯粹的JSON吗？ action的值是否正确无误（大写、无空格）？ parameters的结构是否与模板100%一致？例如，对于CLICK，是否有独立的x和y键，并且它们的值都是整数？ """; } /** * 非流式调用（保留原有逻辑，兼容本地图片） */ @Override public String operation(OparetionRequest request) throws ApiException, NoApiKeyException, UploadFileException, IOException { // 1. 校验核心参数 if (request.getText() == null || request.getText().isEmpty()) { throw new IllegalArgumentException("用户指令text不能为空"); } // 2. 初始化客户端 MultiModalConversation conv = new MultiModalConversation(); // 3. 构建系统提示词 MultiModalMessage systemMsg = MultiModalMessage.builder() .role(Role.SYSTEM.getValue()) .content(Collections.singletonList(Collections.singletonMap("text", buildSystemPrompt()))) .build(); // 4. 构建用户消息（图片+文本） String imageContent = buildImageContent(request); MultiModalMessage userMessage = MultiModalMessage.builder() .role(Role.USER.getValue()) .content(Arrays.asList( Collections.singletonMap("image", imageContent), Collections.singletonMap("text", request.getText()) )).build(); // 5. 构建请求参数（修复API Key使用矛盾） MultiModalConversationParam param = MultiModalConversationParam.builder() .apiKey(apiKey) // 统一使用配置文件的API Key .model(modelName) .messages(Arrays.asList(systemMsg, userMessage)) .build(); // 6. 同步调用+结果解析（增加空指针防护） MultiModalConversationResult result = conv.call(param); if (result == null || result.getOutput() == null || result.getOutput().getChoices() == null || result.getOutput().getChoices().isEmpty()) { log.warn("GUI-Plus返回结果为空"); return "{}"; // 返回空JSON，避免前端解析异常 } List<Map<String, Object>> content = result.getOutput().getChoices().get(0).getMessage().getContent(); String resText = content != null && !content.isEmpty() ? content.get(0).get("text").toString() : "{}"; log.info("GUI-Plus非流式调用完成，结果：{}", resText); return resText; } /** * 新增：SSE流式调用（实时推送结果） */ @Override public SseEmitter streamOperation(OparetionRequest request) { // 设置SSE超时时间（30秒） SseEmitter emitter = new SseEmitter(30000L); // 超时回调 emitter.onTimeout(() -> handleEmitterError(emitter, "SSE连接超时（30秒）")); // 客户端关闭回调 emitter.onCompletion(() -> log.info("SSE连接已关闭")); // 异步执行流式调用（避免阻塞主线程） new Thread(() -> { MultiModalConversation conv = new MultiModalConversation(); try { // 1. 校验参数 if (request.getText() == null || request.getText().isEmpty()) { throw new IllegalArgumentException("用户指令text不能为空"); } // 2. 构建图片内容+消息 String imageContent = buildImageContent(request); MultiModalMessage systemMsg = MultiModalMessage.builder() .role(Role.SYSTEM.getValue()) .content(Collections.singletonList(Collections.singletonMap("text", buildSystemPrompt()))) .build(); MultiModalMessage userMessage = MultiModalMessage.builder() .role(Role.USER.getValue()) .content(Arrays.asList( Collections.singletonMap("image", imageContent), Collections.singletonMap("text", request.getText()) )).build(); // 3. 构建流式请求参数 MultiModalConversationParam param = MultiModalConversationParam.builder() .apiKey(apiKey) .model(modelName) .messages(Arrays.asList(systemMsg, userMessage)) .incrementalOutput(true) // 开启增量输出（流式核心） .build(); // 4. 流式调用+推送结果 Flowable<MultiModalConversationResult> resultFlow = conv.streamCall(param); resultFlow.blockingForEach(item -> { try { if (item.getOutput() == null || item.getOutput().getChoices() == null || item.getOutput().getChoices().isEmpty()) { return; // 空结果跳过 } List<Map<String, Object>> content = item.getOutput().getChoices().get(0).getMessage().getContent(); if (content != null && !content.isEmpty()) { String text = content.get(0).get("text").toString(); // 推送单条流式数据（event名称：message） emitter.send(SseEmitter.event().name("message").data(text)); log.debug("推送流式数据：{}", text); } } catch (Exception e) { log.error("推送单条流式数据失败", e); handleEmitterError(emitter, "数据推送失败：" + e.getMessage()); } }); // 流式结束标记 emitter.send(SseEmitter.event().name("complete").data("流输出完成")); emitter.complete(); log.info("GUI-Plus流式调用完成"); } catch (IOException e) { log.error("读取本地图片失败", e); handleEmitterError(emitter, "读取本地图片失败：" + e.getMessage()); } catch (ApiException | NoApiKeyException | UploadFileException e) { log.error("GUI-Plus API调用失败", e); handleEmitterError(emitter, "API调用失败：" + e.getMessage()); } catch (IllegalArgumentException e) { log.error("请求参数异常", e); handleEmitterError(emitter, "参数错误：" + e.getMessage()); } catch (Exception e) { log.error("流式调用未知异常", e); handleEmitterError(emitter, "系统异常：" + e.getMessage()); } }).start(); return emitter; } /** * 工具方法：统一处理SSE异常 */ private void handleEmitterError(SseEmitter emitter, String errorMsg) { try { emitter.send(SseEmitter.event().name("error").data(errorMsg)); emitter.completeWithError(new RuntimeException(errorMsg)); } catch (Exception e) { log.error("处理SSE发射器异常失败", e); } } }

4. Controller 层（新增流式接口，保留原有接口）

package gzj.spring.ai.Controller; import com.alibaba.dashscope.exception.NoApiKeyException; import com.alibaba.dashscope.exception.UploadFileException; import gzj.spring.ai.Request.OparetionRequest; import gzj.spring.ai.Service.OparetionService; import org.springframework.http.HttpStatus; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.*; import org.springframework.web.servlet.mvc.method.annotation.SseEmitter; import java.io.IOException; import static com.alibaba.cloud.ai.graph.utils.TryConsumer.log; /** * @author DELL */ @RestController @RequestMapping("/api/Operation") @CrossOrigin // 跨域支持 public class OperationController { private final OparetionService oparetionService; public OperationController(OparetionService oparetionService) { this.oparetionService = oparetionService; } @RequestMapping("/operation/easy") public String oparetion(@RequestBody OparetionRequest request) throws NoApiKeyException, UploadFileException, IOException { return oparetionService.operation(request); } /** * 新增接口：SSE流式调用（实时推送结果） */ @PostMapping("/operation/stream") public SseEmitter streamOperation(@RequestBody OparetionRequest request) { log.info("接收SSE流式调用请求：{}", request); return oparetionService.streamOperation(request); } /** * 全局异常处理（可选，优化用户体验） */ @ExceptionHandler(Exception.class) public ResponseEntity<String> globalExceptionHandler(Exception e) { log.error("接口全局异常", e); return ResponseEntity.status(HttpStatus.INTERNAL_SERVER_ERROR) .body("服务器内部错误：" + e.getMessage()); } }

三、总结

核心改造点说明：

改造项	原问题	优化方案
本地图片支持	仅支持网络 URL	新增 localImagePath 参数，通过 Base64 编码转为 GUI-Plus 支持的格式
SSE 流式输出	无流式能力	基于 SDK 的 streamCall 实现流式调用，通过 SseEmitter 实时推送结果
提示词可读性	超长字符串 +\n 转义	改用 Java Text Blocks（"""），结构化排版提示词
API Key 使用	配置注入但未使用，读环境变量	统一使用配置文件的 API Key，环境变量可通过部署时覆盖
空指针风险	链式调用无校验	对 result、content 等关键对象增加非空判断，避免 NPE
异常处理	直接抛出原生异常	新增 SSE 异常处理、Controller 全局异常，返回友好提示
模型配置	硬编码 gui-plus	配置文件抽离模型名，便于切换版本 / 模型
编码错误	本地图片未 Base64 编码	修复 encodeLocalImageToBase64 方法，正确生成带前缀的 Base64 字符串

四、注意事项

本地图片路径需为绝对路径，且应用有文件读取权限（Windows 注意路径分隔符用 \ 或 /）；
流式调用需前端支持 SSE（EventSource），跨域场景需确保后端 CORS 配置正确；
API Key 建议通过环境变量注入（如 DASHSCOPE_API_KEY），避免硬编码到配置文件；
本地图片 Base64 编码后体积会增大～30%，建议控制图片大小（如≤5MB）；
若需支持更多图片格式（如 webp），可扩展 encodeLocalImageToBase64 方法的后缀判断逻辑。

五、示例

从返回结果能看出JSON 格式不完整（x 值数组截断、缺少 y 值、大括号未闭合），核心原因主要有 3 类：

模型输出截断：GUI-Plus 默认输出长度有限，未配置max_tokens参数，导致长 JSON 被截断；
提示词约束不足：原提示词对「JSON 完整性」「参数必填性（如 CLICK 必须有 x/y 整数）」的约束不够明确，模型生成时遗漏字段；
提示词格式问题：原提示词中 JSON 模板的转义 / 排版混乱，模型理解规则时出错，生成不完整 JSON。

由于篇幅和时间限制，对于这些问题的修改我放到下一边文章。

如果觉得这份修改实用、总结清晰，别忘了动动小手点个赞👍，再关注一下呀～后续还会分享更多 AI 接口封装、代码优化的干货技巧，一起解锁更多好用的功能，少踩坑多提效！🥰 你的支持就是我更新的最大动力，咱们下次分享再见呀～🌟

企业官网建设流程全解析

一、前言

整体改造思路

二、代码整改

1. Request 请求类（兼容本地图片 + 原有字段）

2. Service 接口（新增流式方法）

3. Service 实现类（核心改造：本地图片 + 流式 + 原有逻辑）

4. Controller 层（新增流式接口，保留原有接口）

三、总结

四、注意事项

五、示例

热门文章

文章分类

标签云

需要专业的网站建设服务？

企业官网建设流程全解析

一、前言

整体改造思路

二、代码整改

1. Request 请求类（兼容本地图片 + 原有字段）

2. Service 接口（新增流式方法）

3. Service 实现类（核心改造：本地图片 + 流式 + 原有逻辑）

4. Controller 层（新增流式接口，保留原有接口）

三、总结

四、注意事项

五、示例

热门文章

文章分类

标签云

相关文章

需要专业的网站建设服务？