Ramp Labs proposes a new multi-agent memory sharing solution, reducing token consumption by up to 65%.

MeNews · 2026-04-11T15:05:03+00:00

AI infrastructure company Ramp Labs has released the "Latent Briefing" research, which uses attention mechanisms to achieve efficient memory sharing in multi-agent systems, significantly reducing token consumption and improving accuracy. The method performs outstandingly in the LongBench v2 benchmark test, with Worker model token consumption reduced by 65%, and accelerates the compression process to adapt to different task and document length compression needs.

MeNews

2026-04-11 15:05:03

Abstract generation in progress

ME News, April 11 (UTC+8), AI infrastructure company Ramp Labs released their research results “Latent Briefing,” which achieves efficient memory sharing among multi-agent systems by directly compressing large model KV caches, significantly reducing token consumption without sacrificing accuracy. In mainstream multi-agent architectures, the orchestrator disassembles tasks and repeatedly calls the worker models; as the reasoning chain extends, token usage grows exponentially. The core idea of Latent Briefing is to identify the truly critical parts of the context using attention mechanisms, directly discarding redundant information at the representation layer instead of relying on slow LLM summaries or unstable RAG retrievals. In the LongBench v2 benchmark, this method performed remarkably: worker model token consumption decreased by 65%, median token savings for medium-length documents (32k to 100k) reached 49%, overall accuracy improved by about 3 percentage points compared to the baseline, and each compression added only about 1.7 seconds of latency, speeding up the original algorithm by approximately 20 times. The experiments used Claude Sonnet 4 as the orchestrator and Qwen3-14B as the worker model, covering various document scenarios such as academic papers, legal documents, novels, and government reports. The study also found that the optimal compression threshold varies depending on task difficulty and document length—more aggressive compression is suitable for filtering speculative reasoning noise in difficult tasks, while lighter compression better preserves dispersed key information in long documents. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.