| 引用本文: | 郭阿欣,周圆,霍树伟,李硕士.智能体引导的视频重定位网络[J].哈尔滨工业大学学报,2026,58(3):120.DOI:10.11918/202308059 |
| GUO Axin,ZHOU Yuan,HUO Shuwei,LI Shuoshi.Agent-guided video re-localization network[J].Journal of Harbin Institute of Technology,2026,58(3):120.DOI:10.11918/202308059 |
|
| 摘要: |
| 视频重定位的目标是在未经剪辑的参考视频中定位与给定查询视频语义相关的片段。这项任务不仅满足用户的实际浏览需求,而且在多种应用场景中发挥着重要作用。由于视频相较于图像、文本等其他数据类型包含更丰富的信息,因此在长视频中准确识别目标片段并确定其时间边界具有较大挑战。将视频重定位任务视为一个序贯决策过程,应用强化学习实现高效且准确的定位。具体而言,提出智能体引导的定位网络(AGLN),通过训练智能体基于学习到的策略逐步执行动作,细化定位片段的时间边界,从而找到与查询视频最相关的片段。此外,AGLN融合强化学习与监督学习,构建多任务学习框架,助力智能体更有效地探索环境并学习最优策略。在ActivityNet-VRL数据集上的实验结果表明,AGLN在视频重定位任务上的表现优于现有方法,其检索平均准确率达到了25.9%,相较于目前最佳方法提高了0.2个百分点。 |
| 关键词: 视频重定位 强化学习 智能体 监督学习 多任务学习 |
| DOI:10.11918/202308059 |
| 分类号:TP391.4 |
| 文献标识码:A |
| 基金项目:国家重点研发计划(2020YFC1523204);国家自然科学基金(62171320,U2006211) |
|
| Agent-guided video re-localization network |
|
GUO Axin,ZHOU Yuan,HUO Shuwei,LI Shuoshi
|
|
(School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China)
|
| Abstract: |
| Video re-localization aims to localize a moment that semantically corresponds to a given query video from an untrimmed reference video. This task not only meets the actual browsing needs of users but also plays an important role in various application scenarios. Since videos contain richer information compared to other data forms like images and text, accurately identifying the target moment in a long video and determining its temporal boundaries are significantly challenging. This paper regarded the video re-localization task as a sequential decision-making process and applied reinforcement learning to achieve efficient and accurate localization. Specifically, this paper proposed an agent-guided localization network (AGLN), which trained an agent to progressively refine temporal boundaries of the localized moment based on the learned policy, thereby finding the most relevant moment to the query video. Additionally, AGLN combined reinforcement learning with supervised learning in a multi-task learning framework, aiding the agent in more effectively exploring the environment and learning the optimal policy. Experimental results on the ActivityNet-VRL dataset demonstrate that AGLN outperforms existing methods in the video re-localization task. The average retrieval accuracy of AGLN is 25.9%, which is 0.2 percentage points higher than the current optimal method. |
| Key words: video re-localization reinforcement learning agent supervised learning multi-task learning |