LLM Papers
Updated on 2026.04.03
| Publish Date | Title | Authors | |
|---|---|---|---|
| 2026-04-01 | Universal YOCO for Efficient Depth Scaling | Yutao Sun et.al. | 2604.01220 |
| 2026-03-30 | Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill | Seunghun Lee et.al. | 2603.28018 |
| 2026-03-29 | Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts | Sijia Luo et.al. | 2601.10079 |
| 2026-03-28 | CoDec: Prefix-Shared Decoding Kernel for LLMs | Zhibin Wang et.al. | 2505.17694 |
| 2026-03-27 | LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference | Jiawei Yi et.al. | 2511.14510 |
| 2026-03-26 | Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration | Akhiad Bercovich et.al. | 2602.11937 |
| 2026-03-25 | Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning | Adnan Oomerjee et.al. | 2505.16950 |
| 2026-03-25 | ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators | Guoqiang Zou et.al. | 2512.09427 |
| 2026-03-25 | LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling | Dingyan Zhang et.al. | 2603.15202 |
| 2026-03-24 | PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving | Wenfeng Wang et.al. | 2603.23049 |
| 2026-03-24 | StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving | Azam Nouri et.al. | 2603.28795 |
| 2026-03-22 | The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project | Huamin Chen et.al. | 2603.21354 |
| 2026-03-20 | Understanding and Optimizing Multi-Stage AI Inference Pipelines | Abhimanyu Rajeshkumar Bambhaniya et.al. | 2504.09775 |
| 2026-03-20 | KV Cache Optimization Strategies for Scalable and Efficient LLM Inference | Yichun Xu et.al. | 2603.20397 |
| 2026-03-20 | Trained Persistent Memory for Frozen Decoder-Only LLMs | Hong Jeong et.al. | 2603.22329 |
| 2026-03-19 | StreamingThinker: Large Language Models Can Think While Reading | Junlong Tong et.al. | 2510.17238_(ICLR) |
| 2026-03-18 | Multi-stage Flow Scheduling for LLM Serving | Yijun Sun et.al. | 2603.17456 |
| 2026-03-18 | The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency | Huamin Chen et.al. | 2603.17280 |
| 2026-03-18 | Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs | Tuowei Wang et.al. | 2603.17803 |
| 2026-03-18 | Learning When to Attend: Conditional Memory Access for Long-Context LLMs | Sakshi Choudhary et.al. | 2603.17484 |
| 2026-03-18 | IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems | Hongze Liu et.al. | 2603.17302 |
| 2026-03-17 | EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval | Zebin Yang et.al. | 2510.18546_(NeurIPS) |
| 2026-03-17 | Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective | Noppanat Wadlom et.al. | 2603.16104 |
| 2026-03-17 | Efficient Reasoning on the Edge | Yelysei Bondarenko et.al. | 2603.16867 |
| 2026-03-16 | PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel | Jinjun Yi et.al. | 2511.22333_(ASPLOS) |
| 2026-03-16 | Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation | Yanick Zengaffinen et.al. | 2603.15547 |
| 2026-03-15 | Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models | Ying Xie et.al. | 2603.14517 |
| 2026-03-15 | Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys | Xu Yang et.al. | 2603.14224 |
| 2026-03-14 | StreamingTOM: Streaming Token Compression for Efficient Video Understanding | Xueyi Chen et.al. | 2510.18269_(CVPR) |
| 2026-03-14 | DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs | Lizhuo Luo et.al. | 2602.05992 |
| 2026-03-13 | Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity | Donglin Yu et.al. | 2603.12707 |
| 2026-03-13 | Orla: A Library for Serving LLM-Based Multi-Agent Systems | Rana Shahout et.al. | 2603.13605 |
| 2026-03-13 | StatePlane: A Cognitive State Plane for Long-Horizon AI Systems Under Bounded Context | Sasank Annapureddy et.al. | 2603.13644 |
| 2026-03-12 | Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache | Xinhai Wang et.al. | 2603.13420 |
| 2026-03-11 | KV Cache Transform Coding for Compact Storage in LLM Inference | Konrad Staniszewski et.al. | 2511.01815_(ICLR) |
| 2026-03-11 | Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents | Kaiyu Zhou et.al. | 2601.10955 |
| 2026-03-10 | Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework | Kerui Huang et.al. | 2509.14093 |
| 2026-03-10 | ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System | Hao Kang et.al. | 2602.13692 |
| 2026-03-09 | FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference | Guangda Liu et.al. | 2505.13109 |
| 2026-03-09 | EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs | Chang Han et.al. | 2603.08088 |
| 2026-03-09 | LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing | Dongfang Li et.al. | 2603.08453 |
| 2026-03-09 | Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving | Zongze Li et.al. | 2603.13358 |
| 2026-03-08 | Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs | Raghavv Goel et.al. | 2603.07475_(ICLR) |
| 2026-03-06 | Good-Enough LLM Obfuscation (GELO) | Anatoly Belikov et.al. | 2603.05035 |
| 2026-03-06 | MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens | Yu Chen et.al. | 2603.23516 |
| 2026-03-05 | Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator | Cong Li et.al. | 2603.04797 |
| 2026-03-05 | InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context | Xin Teng et.al. | 2603.05353 |
| 2026-03-03 | xLLM Technical Report | Tongxuan Liu et.al. | 2510.14686 |
| 2026-03-03 | Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving | Rui Li et.al. | 2512.22420 |
| 2026-03-02 | AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size | Guanxi Lu et.al. | 2509.26432_(ICLR) |
| 2026-03-02 | OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration | Xinyue Ma et.al. | 2601.10729_(VLDB) |
| 2026-03-02 | Multi-Layer Scheduling for MoE-Based LLM Reasoning | Yifan Sun et.al. | 2602.21626 |
| 2026-03-02 | FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving | Shouwei Gao et.al. | 2602.22593_(ICS) |
| 2026-03-02 | Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics | Samhruth Ananthanarayanan et.al. | 2603.01426 |
| 2026-03-01 | Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs | Ngoc Bui et.al. | 2512.03324 |
| 2026-03-01 | Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention | Mengqi Liao et.al. | 2603.08743 |
| 2026-02-28 | FASA: Frequency-aware Sparse Attention | Yifei Wang et.al. | 2602.03152_(ICLR) |
| 2026-02-28 | RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse | Yingsheng Geng et.al. | 2603.13289 |
| 2026-02-27 | SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning | Sanjay Kariyappa et.al. | 2602.22603 |
| 2026-02-27 | ICaRus: Identical Cache Reuse for Efficient Multi Model Inference | Sunghyeon Woo et.al. | 2603.13281 |
| 2026-02-26 | DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference | Yongtong Wu et.al. | 2602.21548 |
| 2026-02-23 | ContextPilot: Fast Long-Context Inference via Context Reuse | Yinsicheng Jiang et.al. | 2511.03475 |
| 2026-02-21 | Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs | Subham Sekhar Sahoo et.al. | 2506.01928 |
| 2026-02-20 | Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning | Lexiang Tang et.al. | 2602.18232 |
| 2026-02-19 | ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs | Jianlong Lei et.al. | 2603.08727 |
| 2026-02-17 | Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs | HaoYuan Hu et.al. | 2408.00539 |
| 2026-02-14 | KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider | Jiahao Wang et.al. | 2506.02634_(ATC) |
| 2026-02-13 | Doc-to-LoRA: Learning to Instantly Internalize Contexts | Rujikorn Charakorn et.al. | 2602.15902 |
| 2026-02-12 | Efficient Remote Prefix Fetching with GPU-native Media ASICs | Liang Mi et.al. | 2602.09725 |
| 2026-02-12 | SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining | Yifan Zhang et.al. | 2602.10718 |
| 2026-02-12 | PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving | Sunghyeon Woo et.al. | 2602.12029 |
| 2026-02-12 | GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing | Alessio Ricci Toniolo et.al. | 2602.11688 |
| 2026-02-12 | PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System | Lian Liu et.al. | 2602.11521 |
| 2026-02-10 | LLM Serving Optimization with Variable Prefill and Decode Lengths | Meixuan Wang et.al. | 2508.06133 |
| 2026-02-10 | ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs | Yanlin Qi et.al. | 2602.07721 |
| 2026-02-10 | Learning to Evict from Key-Value Cache | Luca Moschella et.al. | 2602.10238 |
| 2026-02-09 | Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction | Jang-Hyun Kim et.al. | 2601.17668_(FAST) |
| 2026-02-09 | Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference | Yifei Gao et.al. | 2602.08329 |
| 2026-02-09 | Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference | Kexin Chu et.al. | 2508.08438 |
| 2026-02-08 | Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model | Tianyi Wang et.al. | 2602.07878 |
| 2026-02-08 | DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity | Jitai Hao et.al. | 2602.08005 |
| 2026-02-06 | DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving | Ying Yuan et.al. | 2602.06502 |
| 2026-02-03 | Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning | Zhicheng Yang et.al. | 2602.03249 |
| 2026-02-03 | ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution | Zican Dong et.al. | 2602.03203 |
| 2026-02-03 | PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference | Rui Ning et.al. | 2602.06072 |
| 2026-02-02 | RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse | Mingrui Liu et.al. | 2602.01795 |
| 2026-02-02 | CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling | Runsong Zhao et.al. | 2602.01766 |
| 2026-02-02 | You Need an Encoder for Native Position-Independent Caching | Shiju Zhao et.al. | 2602.01519 |
| 2026-02-02 | State Rank Dynamics in Linear Attention LLMs | Ao Sun et.al. | 2602.02195 |
| 2026-01-31 | FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning | Hao Mark Chen et.al. | 2509.00195_(ASPLOS) |
| 2026-01-30 | Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models | Elias Hossain et.al. | 2510.17098 |
| 2026-01-30 | Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live | Hanchen Li et.al. | 2511.02230 |
| 2026-01-30 | CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control | Qiaoling Chen et.al. | 2601.22705 |
| 2026-01-30 | Towards Resiliency in Large Language Model Serving with KevlarFlow | Shangshu Qian et.al. | 2601.22438 |
| 2026-01-30 | Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference | Yiding Feng et.al. | 2601.22996 |
| 2026-01-30 | Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference | Nikhil Gopal et.al. | 2602.00328 |
| 2026-01-29 | Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving | Chendong Song et.al. | 2601.21351_(ICML) |
| 2026-01-29 | Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis | Qingyue Yang et.al. | 2601.21709_(ICLR) |
| 2026-01-28 | SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips | Jiahuan Yu et.al. | 2601.20309 |
| 2026-01-28 | Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning | Zeyu Xing et.al. | 2601.20326_(ICLR) |
| 2026-01-28 | ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference | Ketan Thakkar et.al. | 2601.21109 |
| 2026-01-27 | Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding | Zhongyu Xiao et.al. | 2601.17917 |
| 2026-01-26 | Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective | Fangzhou Wu et.al. | 2601.18999_(ICLR) |
| 2026-01-21 | QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design | Nilesh Prasad Pandey et.al. | 2601.14549 |
| 2026-01-20 | KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments | Junyoung Park et.al. | 2504.15364_(NeurIPS) |
| 2026-01-20 | ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management | Jing Zou et.al. | 2601.13631 |
| 2026-01-20 | HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference | Zhiyuan Shi et.al. | 2601.13684 |
| 2026-01-20 | LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems | Badri N. Patro et.al. | 2601.14053 |
| 2026-01-19 | Cache Your Prompt When It’s Green: Carbon-Aware Caching for Large Language Model Serving | Yuyang Tian et.al. | 2505.23970 |
| 2026-01-19 | SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences | Jungyoub Cha et.al. | 2505.20776 |
| 2026-01-19 | Batch Query Processing and Optimization for Agentic Workflows | Junyi Shen et.al. | 2509.02121 |
| 2026-01-19 | Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference | Anish Biswas et.al. | 2601.12967 |
| 2026-01-19 | From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation | Jiahao Wang et.al. | 2601.12904 |
| 2026-01-18 | Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline | Jiawei Xu et.al. | 2601.12307 |
| 2026-01-16 | RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation | Amna Masood et.al. | 2601.11822 |
| 2026-01-15 | Online Scheduling for LLM Inference with KV Cache Constraints | Patrick Jaillet et.al. | 2502.07115 |
| 2026-01-15 | AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving | Shaoting Feng et.al. | 2509.00105_(SOSP) |
| 2026-01-15 | Hardware Acceleration for Neural Networks: A Comprehensive Survey | Bin Xu et.al. | 2512.23914 |
| 2026-01-14 | APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs | Jiakun Fan et.al. | 2506.03296 |
| 2026-01-14 | CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation | Hasan Akgul et.al. | 2510.19670 |
| 2026-01-14 | Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling | Zhixiang Liang et.al. | 2601.09093 |
| 2026-01-13 | TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL | Jinbo Su et.al. | 2601.08743 |
| 2026-01-12 | Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference | Rei Taniguchi et.al. | 2601.07667_(CHI) |
| 2026-01-11 | Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models | Haoyu Wang et.al. | 2506.07334 |
| 2026-01-11 | Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention | Zhen Yang et.al. | 2510.13940 |
| 2026-01-09 | AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving | Tianhao Xu et.al. | 2601.06288 |
| 2026-01-07 | MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing | Zhaoyuan Su et.al. | 2506.02006 |
| 2026-01-07 | InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing | Shuaiyi Li et.al. | 2505.22156 |
| 2026-01-06 | Making MoE-based LLM Inference Resilient with Tarragon | Songyu Zhang et.al. | 2601.01310 |
| 2026-01-06 | Joint Encoding of KV-Cache Blocks for Scalable LLM Serving | Joseph Kampeas et.al. | 2601.03067 |
| 2026-01-05 | Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints | Ruicheng Ao et.al. | 2504.11320 |
| 2026-01-05 | LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference | Hossein Rajabzadeh et.al. | 2601.02569 |
| 2026-01-05 | Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle | Zihan Wang et.al. | 2601.16986 |
| 2026-01-03 | Warp-Cortex: An Asynchronous, Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware | Jorge L. Ruiz Williams et.al. | 2601.01298 |
| 2026-01-03 | KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs | Yixuan Tang et.al. | 2601.01046 |
| 2025-12-28 | Attention Is All You Need for KV Cache in Diffusion LLMs | Quan Nguyen-Tri et.al. | 2510.14973 |
| 2025-12-28 | WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference | Aiwei Liu et.al. | 2512.22737 |
| 2025-12-25 | PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System | Hyucksung Kwon et.al. | 2412.20166_(CHI) |
| 2025-12-25 | Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths | Ahilan Ayyachamy Nadar Ponnusamy et.al. | 2601.11564 |
| 2025-12-24 | V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval | Donghyuk Kim et.al. | 2512.12284_(HPCA) |
| 2025-12-22 | MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning | Tao Zhang et.al. | 2512.19206 |
| 2025-12-20 | TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale | Dongha Yoon et.al. | 2512.18194 |
| 2025-12-20 | MatKV: Trading Compute for Flash Storage in LLM Inference | Kun-Woo Shin et.al. | 2512.22195_(ICDE) |
| 2025-12-19 | xGR: Efficient Generative Recommendation Serving at Scale | Qingxiao Sun et.al. | 2512.11529 |
| 2025-12-18 | MEPIC: Memory Efficient Position Independent Caching for LLM Serving | Qian Wang et.al. | 2512.16822 |
| 2025-12-17 | CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing | Kuan Lu et.al. | 2512.15550 |
| 2025-12-17 | Dynamic Rebatching for Efficient Early-Exit Inference with DREX | Xuting Liu et.al. | 2512.15705 |
| 2025-12-16 | SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching | Xinye Zhao et.al. | 2509.24832 |
| 2025-12-16 | Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents | Hongqiu Ni et.al. | 2512.14142 |
| 2025-12-16 | EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving | Shaoting Feng et.al. | 2512.14946 |
| 2025-12-16 | Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading | William Meng et.al. | 2601.19910 |
| 2025-12-15 | Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing | Zewen Qiang et.al. | 2512.13109 |
| 2025-12-14 | Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving | Hui Zeng et.al. | 2511.06029_(AAAI) |
| 2025-12-12 | Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference | Adilet Metinov et.al. | 2512.11221 |
| 2025-12-12 | Hold Onto That Thought: Assessing KV Cache Compression On Reasoning | Minghui Liu et.al. | 2512.12008 |
| 2025-12-11 | Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders | Qingsen Ma et.al. | 2512.10547 |
| 2025-12-11 | CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving | Dong Liu et.al. | 2512.11920_(FPGA) |
| 2025-12-10 | SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators | Jonathan Li et.al. | 2511.03092 |
| 2025-12-10 | PerCache: Predictive Hierarchical Cache for RAG Applications on Mobile Devices | Kaiwei Liu et.al. | 2601.11553 |
| 2025-12-08 | H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference | Zizhuo Fu et.al. | 2508.16653_(ICC) |
| 2025-12-08 | Leveraging KV Similarity for Online Structured Pruning in LLMs | Jungmin Lee et.al. | 2512.07090 |
| 2025-12-07 | KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models | Sourjya Roy et.al. | 2512.06727 |
| 2025-12-07 | ELANA: A Simple Energy and Latency Analyzer for LLMs | Hung-Yueh Chiang et.al. | 2512.09946 |
| 2025-12-05 | LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference | Yuhan Liu et.al. | 2510.09665 |
| 2025-12-04 | KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs | Prashant Pandey et.al. | 2512.11851 |
| 2025-12-02 | SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification | Zhendong Tan et.al. | 2512.02337 |
| 2025-12-01 | Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity | Wenbin Zhu et.al. | 2512.01357 |
| 2025-12-01 | KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction | Aomufei Yuan et.al. | 2512.17917 |
| 2025-11-30 | SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving | Bohan Zhao et.al. | 2512.00719 |
| 2025-11-30 | SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs | Jiaming Xu et.al. | 2512.00722_(ASPLOS) |
| 2025-11-29 | G-KV: Decoding-Time KV Cache Eviction with Global Attention | Mengqi Liao et.al. | 2512.00504 |
| 2025-11-27 | Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression | Boris Kriuk et.al. | 2512.17914 |
| 2025-11-26 | No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha | Amey Agrawal et.al. | 2409.17264 |
| 2025-11-26 | Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA | Allison Li et.al. | 2512.17910 |
| 2025-11-25 | On 10x Better Scalability: KV Stores Scale Up KV Cache | Weiping Yu et.al. | 2511.16138 |
| 2025-11-25 | Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation | Inferix Team et.al. | 2511.20714 |
| 2025-11-24 | SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression | Santhosh G S et.al. | 2511.18936 |
| 2025-11-24 | ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models | Long Lian et.al. | 2512.07843 |
| 2025-11-23 | Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost | Haojun Xia et.al. | 2511.18643 |
| 2025-11-20 | KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference | Xing Li et.al. | 2502.04420_(ICML) |
| 2025-11-18 | Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models | Rui Zhu et.al. | 2511.14694 |
| 2025-11-17 | Hogwild! Inference: Parallel LLM Generation via Concurrent Attention | Gleb Rodionov et.al. | 2504.06261_(NeurIPS) |
| 2025-11-14 | Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications | Jiaxi Li et.al. | 2601.08833 |
| 2025-11-13 | LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components | Yaru Li et.al. | 2511.10394_(ISS) |
| 2025-11-13 | $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving | Yuechi Zhou et.al. | 2511.17560 |
| 2025-11-11 | Glia: A Human-Inspired AI for Automated Systems Design and Optimization | Pouya Hamadanian et.al. | 2510.27176 |
| 2025-11-10 | KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse | Jingbo Yang et.al. | 2502.16002 |
| 2025-11-10 | StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression | Yilong Chen et.al. | 2511.07278 |
| 2025-11-09 | SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention | Bohan Yu et.al. | 2511.06446_(AAAI) |
| 2025-11-08 | Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching | Yanhao Dong et.al. | 2504.06319 |
| 2025-11-07 | Inference-Time Hyper-Scaling with KV Cache Compression | Adrian Łańcucki et.al. | 2506.05345_(NeurIPS) |
| 2025-11-07 | Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations | Maurizio Diaz et.al. | 2508.17032 |
| 2025-11-06 | SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference | Tian Xia et.al. | 2505.24095 |
| 2025-11-06 | HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts | Neil He et.al. | 2505.24722 |
| 2025-11-06 | Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing | Mingyu Sung et.al. | 2511.04002 |
| 2025-11-06 | DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing | Lei Gao et.al. | 2511.04791 |
| 2025-11-05 | FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs | Xuan He et.al. | 2511.00807_(ISS) |
| 2025-11-05 | ALAS: Transactional and Dynamic Multi-Agent LLM Planning | Longling Geng et.al. | 2511.03094 |
| 2025-11-05 | AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism | Wendong Xu et.al. | 2511.11617_(DATE) |
| 2025-11-04 | LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context | Yudong Li et.al. | 2511.02366 |
| 2025-11-04 | Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration | Jingbo Wang et.al. | 2511.02200 |
| 2025-11-03 | Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving | Chengying Huan et.al. | 2511.01633 |
| 2025-11-03 | TPS-Bench: Evaluating AI Agents’ Tool Planning \& Scheduling Abilities in Compounding Tasks | Hanwen Xu et.al. | 2511.01527 |
| 2025-11-02 | HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL | You Peng et.al. | 2505.05286 |
| 2025-11-02 | FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management | Nazmul Takbir et.al. | 2511.00868 |
| 2025-11-01 | KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems | Hancheng Ye et.al. | 2510.12872_(FAST) |
| 2025-11-01 | A CPU-Centric Perspective on Agentic AI | Ritik Raj et.al. | 2511.00739 |
| 2025-11-01 | EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory | Wenzhe Fan et.al. | 2511.01912 |
| 2025-11-01 | Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization | Massinissa Merouani et.al. | 2511.00592_(CHI) |
| 2025-10-31 | Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications | Zhuohang Bian et.al. | 2510.18586 |
| 2025-10-31 | Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits | Dowon Kim et.al. | 2511.00321 |
| 2025-10-30 | Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling | Reda El Makroum et.al. | 2510.26603 |
| 2025-10-29 | Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving | Ruihao Li et.al. | 2507.11507 |
| 2025-10-29 | Serve Programs, Not Prompts | In Gim et.al. | 2510.25412_(HotOS) |
| 2025-10-29 | KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA | Zhuo Chen et.al. | 2510.25101 |
| 2025-10-28 | Pie: A Programmable Serving System for Emerging LLM Applications | In Gim et.al. | 2510.24051_(SOSP) |
| 2025-10-28 | Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion | Xianjun Gao et.al. | 2510.24390 |
| 2025-10-28 | From Narrative to Action: A Hierarchical LLM-Agent Framework for Human Mobility Generation | Qiumeng Li et.al. | 2510.24802 |
| 2025-10-26 | Batch Speculative Decoding Done Right | Ranran Haoran Zhang et.al. | 2510.22876 |
| 2025-10-26 | SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size | Jinhan Chen et.al. | 2510.22556 |
| 2025-10-26 | SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming | Adhyayan Veer Singh et.al. | 2510.22626 |
| 2025-10-24 | Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning | Jiwon Song et.al. | 2505.13866 |
| 2025-10-23 | Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning | Yu Fu et.al. | 2410.19258_(ICLR) |
| 2025-10-23 | Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution | Zhiyang Chen et.al. | 2510.15312 |
| 2025-10-23 | HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement | Danying Ge et.al. | 2510.20878 |
| 2025-10-22 | PTFA: An LLM-based Agent that Facilitates Online Consensus Building through Parallel Thinking | Wen Gu et.al. | 2503.12499 |
| 2025-10-21 | The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems | Linke Song et.al. | 2409.20002_(ICS) |
| 2025-10-21 | Reasoning Language Model Inference Serving Unveiled: An Empirical Study | Qi Li et.al. | 2510.18672 |
| 2025-10-21 | The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability | Zijie Xu et.al. | 2510.18563 |
| 2025-10-19 | STARK: Strategic Team of Agents for Refining Kernels | Juncheng Dong et.al. | 2510.16996 |
| 2025-10-18 | Ripple Effect Protocol: Coordinating Agent Populations | Ayush Chopra et.al. | 2510.16572 |
| 2025-10-16 | Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies | Mason Nakamura et.al. | 2510.14312 |
| 2025-10-16 | Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing | Tianhua Xia et.al. | 2510.16040 |
| 2025-10-15 | LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning | Haoyue Zhang et.al. | 2506.15969 |
| 2025-10-15 | BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure | Yiyuan He et.al. | 2510.13223 |
| 2025-10-15 | Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving | Nikos Pagonas et.al. | 2510.14126 |
| 2025-10-14 | Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models | Rabimba Karanjai et.al. | 2510.12080 |
| 2025-10-13 | MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE | Soheil Zibakhsh et.al. | 2509.17238 |
| 2025-10-13 | Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models | Junhyuck Kim et.al. | 2510.10964 |
| 2025-10-13 | Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony | Han Lu et.al. | 2510.11345 |
| 2025-10-11 | CacheClip: Accelerating RAG with Effective KV Cache Reuse | Bin Yang et.al. | 2510.10129 |
| 2025-10-11 | Agentic Troubleshooting Guide Automation for Incident Management | Jiayi Mao et.al. | 2510.10074 |
| 2025-10-10 | OrcaLoca: An LLM Agent Framework for Software Issue Localization | Zhongming Yu et.al. | 2502.00350 |
| 2025-10-09 | Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval | Wenhao Li et.al. | 2508.19740 |
| 2025-10-08 | AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs | Peize He et.al. | 2510.07293 |
| 2025-10-08 | AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding | Shuqing Luo et.al. | 2510.07486 |
| 2025-10-08 | FLEET: Formal Language-Grounded Scheduling for Heterogeneous Robot Teams | Corban Rivera et.al. | 2510.07417 |
| 2025-10-07 | VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization | Dingyu Yao et.al. | 2510.06175 |
| 2025-10-07 | H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference | Harshil Vejendla et.al. | 2510.05529 |
| 2025-10-06 | ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering | Yuki Imajuku et.al. | 2506.09050_(NeurIPS) |
| 2025-10-06 | Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning | Edward Y. Chang et.al. | 2510.04488 |
| 2025-10-03 | TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling | Junyi Chen et.al. | 2510.02758_(EuroSys) |
| 2025-10-03 | Automatic Building Code Review: A Case Study | Hanlong Wan et.al. | 2510.02634 |
| 2025-10-02 | QSpec: Speculative Decoding with Complementary Quantization Schemes | Juntao Zhao et.al. | 2410.11305 |
| 2025-10-02 | KaVa: Latent Reasoning via Compressed KV-Cache Distillation | Anna Kuzina et.al. | 2510.02312 |
| 2025-10-02 | ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models | Gursimran Singh et.al. | 2510.02613_(ISS) |
| 2025-10-02 | KVComm: Enabling Efficient LLM Communication through Selective KV Sharing | Xiangyu Shi et.al. | 2510.03346 |
| 2025-10-01 | Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies | Kyoungmin Kim et.al. | 2411.07447 |
| 2025-09-30 | KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction | Jang-Hyun Kim et.al. | 2505.23416_(NeurIPS) |
| 2025-09-30 | AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges | Ranjan Sapkota et.al. | 2505.10468 |
| 2025-09-30 | Towards Agentic OS: An LLM Agent Framework for Linux Schedulers | Yusheng Zheng et.al. | 2509.01245 |
| 2025-09-29 | Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models | Keda Tao et.al. | 2503.16257 |
| 2025-09-29 | KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation | Ching Han Chen et.al. | 2505.07618 |
| 2025-09-29 | SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching | Yuxuan Zhu et.al. | 2504.00970 |
| 2025-09-29 | SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving | Qihui Zhou et.al. | 2509.24626 |
| 2025-09-29 | SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents | Gyuhyeon Seo et.al. | 2509.24282 |
| 2025-09-28 | From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle | Kaustubh Vyas et.al. | 2412.12839_(ICLR) |
| 2025-09-28 | HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models | Zhinan Xie et.al. | 2509.23928 |
| 2025-09-27 | Runtime Adaptive Pruning for LLM Inference | Huanrong Liu et.al. | 2505.17138 |
| 2025-09-27 | ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration | Xianglong Yan et.al. | 2505.24357 |
| 2025-09-27 | READER: Retrieval-Assisted Drafter for Efficient LLM Inference | Maxim Divilkovskiy et.al. | 2508.09072 |
| 2025-09-26 | KV Cache Steering for Controlling Frozen LLMs | Max Belitsky et.al. | 2507.08799 |
| 2025-09-26 | ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration | Gaole Dai et.al. | 2509.21823 |
| 2025-09-26 | Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents | Heyang Gao et.al. | 2510.03253 |
| 2025-09-26 | LLM Assisted Alpha Fairness for 6 GHz WiFi and NR_U Coexistence: An Agentic Orchestrator for Throughput, Energy, and SLA | Qun Wang et.al. | 2510.17814 |
| 2025-09-25 | HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling | Zahra Yousefijamarani et.al. | 2508.15919 |
| 2025-09-25 | Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization | Yuhang Xu et.al. | 2509.21301 |
| 2025-09-24 | UNComp: Can Matrix Entropy Uncover Sparsity? – A Compressor Design from an Uncertainty-Aware Perspective | Jing Xiong et.al. | 2410.03090_(EMNLP) |
| 2025-09-24 | Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference | Haoyu Chen et.al. | 2509.19729 |
| 2025-09-24 | CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks | Jiewei Chen et.al. | 2509.19855 |
| 2025-09-22 | A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor | Zhen Zhao et.al. | 2405.16887 |
| 2025-09-22 | Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings | Stephen Zhang et.al. | 2502.00919 |
| 2025-09-22 | Asteria: Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access | Chaoyi Ruan et.al. | 2509.17360 |
| 2025-09-21 | ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching | Xingyu Xiang et.al. | 2509.16857 |
| 2025-09-20 | EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs | Zhengge Cai et.al. | 2509.16686 |
| 2025-09-20 | Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games | Niv Eckhaus et.al. | 2506.05309 |
| 2025-09-19 | Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap | Andrew Zhu et.al. | 2509.16325 |
| 2025-09-17 | CrowdAgent: Multi-Agent Managed Multi-Source Annotation System | Maosheng Qin et.al. | 2509.14030 |
| 2025-09-16 | FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference | Dongwei Wang et.al. | 2508.08256_(EMNLP) |
| 2025-09-15 | Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System | Yunhua Fang et.al. | 2508.13231_(CHI) |
| 2025-09-15 | FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving | Kyungmin Bin et.al. | 2509.06261 |
| 2025-09-08 | Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing | Rui Xie et.al. | 2509.03377 |
| 2025-09-08 | Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models | Yinjie Wang et.al. | 2509.06949 |
| 2025-09-03 | ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving | Yifan Qiao et.al. | 2410.01228 |
| 2025-09-01 | TRACE-CS: A Hybrid Logic-LLM System for Explainable Course Scheduling | Stylianos Loukas Vasileiou et.al. | 2409.03671 |
| 2025-09-01 | LLMs cannot spot math errors, even when allowed to peek into the solution | KV Aditya Srivatsa et.al. | 2509.01395_(EMNLP) |
| 2025-08-30 | DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction | Yanqi Zhang et.al. | 2412.03131_(SOSP) |
| 2025-08-30 | LLM-Assisted Iterative Evolution with Swarm Intelligence Toward SuperBrain | Li Weigang et.al. | 2509.00510 |
| 2025-08-29 | Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward | Yong Deng et.al. | 2508.12800 |
| 2025-08-28 | TinyServe: Query-Aware Cache Selection for Efficient LLM Serving | Dong Liu et.al. | 2509.12211_(ACM MM) |
| 2025-08-26 | Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing | Junyi Wen et.al. | 2507.08045 |
| 2025-08-26 | Strata: Hierarchical Context Caching for Long Context Language Model Serving | Zhiqiang Xie et.al. | 2508.18572 |
| 2025-08-24 | PRISM: Efficient Long-Range Reasoning With Short-Context LLMs | Dulhan Jayalath et.al. | 2412.18914_(EMNLP) |
| 2025-08-21 | Efficient Mixed-Precision Large Language Model Inference with TurboMind | Li Zhang et.al. | 2508.15601 |
| 2025-08-20 | Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration | Peilin Ji et.al. | 2508.14654 |
| 2025-08-18 | Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis | Ayoub Ben Chaliah et.al. | 2508.13382 |
| 2025-08-17 | ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads | Zhuorui Liu et.al. | 2508.12407 |
| 2025-08-15 | UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs? | Mukund Choudhary et.al. | 2508.11260 |
| 2025-08-14 | SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression | Mengjie Li et.al. | 2508.15806 |
| 2025-08-14 | ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs | Keyu Chen et.al. | 2508.08895 |
| 2025-08-12 | Retrospective Sparse Attention for Efficient Long-Context Generation | Seonghwan Choi et.al. | 2508.09001 |
| 2025-08-12 | Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation | Jiongchi Yu et.al. | 2508.07745 |
| 2025-08-12 | AIOS: LLM Agent Operating System | Kai Mei et.al. | 2403.16971 |
| 2025-08-11 | Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories | Ming-Yen Lee et.al. | 2508.08457 |
| 2025-08-11 | From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework | Yunkai Hu et.al. | 2508.08147 |
| 2025-08-09 | Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud | Jinyuan Chen et.al. | 2508.06948 |
| 2025-08-06 | p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay | Jun Zhang et.al. | 2412.04449_(ICC) |
| 2025-08-06 | StackPilot: Autonomous Function Agents for Scalable and Environment-Free Code Execution | Xinkui Zhao et.al. | 2508.11665 |
| 2025-08-06 | AquaChat++: LLM-Assisted Multi-ROV Inspection for Aquaculture Net Pens with Integrated Battery Management and Thruster Fault Tolerance | Abdelhaleem Saad et.al. | 2508.06554 |
| 2025-08-05 | REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks | Longling Geng et.al. | 2502.18836 |
| 2025-08-04 | CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation | Xiaolin Lin et.al. | 2508.02401 |
| 2025-08-01 | CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization | Yuning Jiang et.al. | 2508.00478 |
| 2025-07-30 | A Survey on Large Language Model Acceleration based on KV Cache Management | Haoyang Li et.al. | 2412.19442 |
| 2025-07-29 | Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling | Rajeev Patwari et.al. | 2508.00904 |
| 2025-07-29 | StaffPro: an LLM Agent for Joint Staffing and Profiling | Alessio Maritan et.al. | 2507.21636 |
| 2025-07-26 | FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression | Runchao Li et.al. | 2507.20030 |
| 2025-07-25 | Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding | StepFun et.al. | 2507.19427 |
| 2025-07-24 | NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database | Weizhi Fei et.al. | 2507.18028 |
| 2025-07-22 | Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning | Hongyin Luo et.al. | 2507.16784 |
| 2025-07-21 | LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra | Seth Karten et.al. | 2507.15815 |
| 2025-07-18 | DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation | Ziqi Wang et.al. | 2507.14267 |
| 2025-07-18 | CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education | Jianing Zhao et.al. | 2507.13814 |
| 2025-07-14 | InstCache: A Predictive Cache for LLM Serving | Longwei Zou et.al. | 2411.13820 |
| 2025-07-14 | DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving | Yuhan Liu et.al. | 2411.02820 |
| 2025-07-10 | Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores | Vivek Chari et.al. | 2507.08143 |
| 2025-07-10 | KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows | Zaifeng Pan et.al. | 2507.07400 |
| 2025-07-09 | Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration | Xinyuan Song et.al. | 2507.06520 |
| 2025-07-08 | OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety | Sanidhya Vijayvargiya et.al. | 2507.06134 |
| 2025-07-07 | StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling | Meng Wei et.al. | 2507.05240 |
| 2025-07-04 | Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought | Tencent Hunyuan Team et.al. | 2505.15431 |
| 2025-07-01 | VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator | Zhican Wang et.al. | 2507.00797_(DAC) |
| 2025-07-01 | EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens | Chaoqun Yang et.al. | 2507.00715_(KDD) |
| 2025-06-30 | Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC | Xinming Wei et.al. | 2506.24045 |
| 2025-06-30 | RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference | Yaoqi Chen et.al. | 2505.02922 |
| 2025-06-28 | Efficiently Serving Large Multimodal Models Using EPD Disaggregation | Gursimran Singh et.al. | 2501.05460 |
| 2025-06-28 | FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets | Shrenik Jadhav et.al. | 2506.22708 |
| 2025-06-27 | Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference | Yaohua Tang et.al. | 2502.15294 |
| 2025-06-26 | CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation | Nicolas Bougie et.al. | 2506.21805 |
| 2025-06-26 | MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models | Yifan Liu et.al. | 2506.21784 |
| 2025-06-25 | MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation | Gurusha Juneja et.al. | 2506.20737 |
| 2025-06-23 | RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding | Guanzheng Chen et.al. | 2502.20330_(ICML) |
| 2025-06-18 | eLLM: Elastic Memory Management Framework for Efficient LLM Serving | Jiale Xu et.al. | 2506.15155 |
| 2025-06-18 | Moment Sampling in Video LLMs for Long-Form Video QA | Mustafa Chasmai et.al. | 2507.00033_(CVPR) |
| 2025-06-17 | LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification | Penghui Yang et.al. | 2502.17421 |
| 2025-06-16 | AlphaEvolve: A coding agent for scientific and algorithmic discovery | Alexander Novikov et.al. | 2506.13131 |
| 2025-06-14 | ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression | Guangda Liu et.al. | 2412.03213 |
| 2025-06-13 | FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference | Runheng Liu et.al. | 2405.04065_(ACL) |
| 2025-06-12 | SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding | Ziyi Zhang et.al. | 2506.11309 |
| 2025-06-12 | SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models | Kaushal Kumar Maurya et.al. | 2408.08545 |
| 2025-06-11 | SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems | Peiran Li et.al. | 2506.07564 |
| 2025-06-09 | DeepServe: Serverless Large Language Model Serving at Scale | Junhao Hu et.al. | 2501.14417 |
| 2025-06-08 | MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache | Akshat Sharma et.al. | 2411.18077 |
| 2025-06-07 | EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments | Sara Fish et.al. | 2503.18825 |
| 2025-06-05 | Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback | Junior Cedric Tonga et.al. | 2506.04920_(ISS) |
| 2025-06-04 | KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation | Chaoyi Jiang et.al. | 2411.17089_(ACL) |
| 2025-06-04 | HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing | Minghui Liu et.al. | 2412.16187 |
| 2025-06-04 | AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance | Dhaval Patel et.al. | 2506.03828 |
| 2025-06-03 | A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization | Junhui He et.al. | 2502.12665 |
| 2025-06-03 | SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation | Jialong Wu et.al. | 2412.13649_(ACL) |
| 2025-06-02 | SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation | Aurick Qiao et.al. | 2410.03960 |
| 2025-06-01 | A Survey of LLM $\times$ DATA | Xuanhe Zhou et.al. | 2505.18458 |
| 2025-05-30 | RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning | Junhao Hu et.al. | 2502.11147 |
| 2025-05-30 | Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding | Feiyu Yao et.al. | 2506.15704 |
| 2025-05-29 | EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse | Tianyu Guo et.al. | 2505.21889_(DIS) |
| 2025-05-28 | gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling | Tianyu Guo et.al. | 2504.14775 |
| 2025-05-28 | KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization | Coleman Hooper et.al. | 2401.18079_(NeurIPS) |
| 2025-05-28 | Design and testing of an agent chatbot supporting decision making with public transport data | Luca Fantin et.al. | 2505.22698 |
| 2025-05-27 | Hardware-Efficient Attention for Fast Decoding | Ted Zadouri et.al. | 2505.21487 |
| 2025-05-27 | TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization | Dingyu Yao et.al. | 2505.19586 |
| 2025-05-27 | EPIC: Efficient Position-Independent Caching for Serving Large Language Models | Junhao Hu et.al. | 2410.15332 |
| 2025-05-26 | PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving | Ahmet Caner Yüzügüler et.al. | 2501.08192 |
| 2025-05-26 | BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems | Yuxin Wang et.al. | 2401.17644 |
| 2025-05-26 | Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents | Ye Ye et.al. | 2505.19436 |
| 2025-05-24 | PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs | Tengxuan Liu et.al. | 2505.18610 |
| 2025-05-23 | Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence | Amirhosein Ghasemabadi et.al. | 2505.20325 |
| 2025-05-23 | ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy | Gengyang Li et.al. | 2505.15684 |
| 2025-05-23 | Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation | Yuelyu Ji et.al. | 2505.17391 |
| 2025-05-23 | Boosting Long-Context Management via Query-Guided Activation Refilling | Hongjin Qian et.al. | 2412.12486_(ACL) |
| 2025-05-23 | Mitigate Position Bias in Large Language Models via Scaling a Single Dimension | Yijiong Yu et.al. | 2406.02536_(ACL) |
| 2025-05-21 | Can LLMs Maintain Fundamental Abilities under KV Cache Compression? | Xiang Liu et.al. | 2502.01941 |
| 2025-05-21 | LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval | Zhenyu Ning et.al. | 2505.15269 |
| 2025-05-20 | CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration | Pengyan Zhu et.al. | 2505.14085 |
| 2025-05-20 | Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation | Peter Baile Chen et.al. | 2505.14398 |
| 2025-05-20 | Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding | Sakhinana Sagar Srinivas et.al. | 2504.01281 |
| 2025-05-19 | SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache | Qiuyu Zhu et.al. | 2505.10951 |
| 2025-05-19 | Learning Virtual Machine Scheduling in Cloud Computing through Language Agents | JieHao Wu et.al. | 2505.10117 |
| 2025-05-18 | ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning | Edward Y. Chang et.al. | 2505.12501 |
| 2025-05-17 | Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents | Tiannuo Yang et.al. | 2505.12065 |
| 2025-05-17 | OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents | Raghav Thind et.al. | 2504.16918 |
| 2025-05-16 | KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse | Huan Yang et.al. | 2503.16525 |
| 2025-05-14 | Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization | Minsu Kim et.al. | 2503.18599 |
| 2025-05-13 | Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs | Lucas Maisonnave et.al. | 2504.13989 |
| 2025-05-12 | SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models | Hang Wu et.al. | 2505.07680 |
| 2025-05-12 | PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications | Kuntai Du et.al. | 2505.07203 |
| 2025-05-09 | Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM | Zehao Fan et.al. | 2505.05772 |
| 2025-05-01 | QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving | Yujun Lin et.al. | 2405.04532 |
| 2025-04-28 | semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage | Ke Hong et.al. | 2504.19867 |
| 2025-04-25 | ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference | Hanshi Sun et.al. | 2410.21465 |
| 2025-04-24 | L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference | Qingyuan Liu et.al. | 2504.17584 |
| 2025-04-24 | Tempo: Application-aware LLM Serving with Mixed SLO Requirements | Wei Zhang et.al. | 2504.20068 |
| 2025-04-24 | Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents | Yueying Li et.al. | 2504.07347 |
| 2025-04-22 | Optimizing SLO-oriented LLM Serving with PD-Multiplexing | Weihao Cui et.al. | 2504.14489 |
| 2025-04-21 | LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention | Shang Yang et.al. | 2502.14866 |
| 2025-04-21 | FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving | Zihao Ye et.al. | 2501.01005 |
| 2025-04-21 | PLANET: A Collection of Benchmarks for Evaluating LLMs’ Planning Capabilities | Haoming Li et.al. | 2504.14773 |
| 2025-04-19 | Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management | Hang Zhang et.al. | 2505.03756 |
| 2025-04-16 | Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading | Kihyun Kim et.al. | 2504.11816 |
| 2025-04-16 | Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs | Hyungwoo Lee et.al. | 2504.11765 |
| 2025-04-14 | AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference | Yangshen Deng et.al. | 2504.10326 |
| 2025-04-13 | Efficient LLM Serving on Hybrid Real-time and Best-effort Requests | Wan Borui et.al. | 2504.09590 |
| 2025-04-13 | Block-Attention for Efficient Prefilling | Dongyang Ma et.al. | 2409.15355_(ICLR) |
| 2025-04-10 | Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving | Shihong Gao et.al. | 2504.07494 |
| 2025-04-09 | Optimizing LLM Queries in Relational Data Analytics Workloads | Shu Liu et.al. | 2403.05821 |
| 2025-04-09 | MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation | Hongjin Qian et.al. | 2409.05591_(TheWebConf) |
| 2025-04-03 | CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | Jiayi Yao et.al. | 2405.16444 |
| 2025-04-03 | HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse | Yuwei An et.al. | 2504.02921 |
| 2025-04-02 | MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding | Ranajoy Sadhukhan et.al. | 2408.11049 |
| 2025-04-01 | Personality-Driven Decision-Making in LLM-Based Autonomous Agents | Lewis Newsham et.al. | 2504.00727 |
| 2025-04-01 | HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents | Shiyi Liu et.al. | 2504.00434 |
| 2025-03-31 | Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving | Wei Gao et.al. | 2503.24000 |
| 2025-03-31 | Training-Free Exponential Context Extension via Cascading KV Cache | Jeffrey Willette et.al. | 2406.17808 |
| 2025-03-25 | Agent-Initiated Interaction in Phone UI Automation | Noam Kahlon et.al. | 2503.19537 |
| 2025-03-24 | Mitigating KV Cache Competition to Enhance User Experience in LLM Inference | Haiying Shen et.al. | 2503.13773 |
| 2025-03-19 | Exploring Large Language Models for Word Games:Who is the Spy? | Chentian Wei et.al. | 2503.15235 |
| 2025-03-12 | COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation | Di Zhao et.al. | 2503.09263 |
| 2025-03-11 | FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework | Jianian Zhu et.al. | 2503.08461 |
| 2025-03-11 | LLM4MAC: An LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence | Renxuan Tan et.al. | 2503.08123 |
| 2025-03-11 | Agent-Oriented Planning in Multi-Agent Systems | Ao Li et.al. | 2410.02189_(ICLR) |
| 2025-03-11 | SCBench: A KV Cache-Centric Analysis of Long-Context Methods | Yucheng Li et.al. | 2412.10319_(ICLR) |
| 2025-03-10 | Queueing, Predictions, and LLMs: Challenges and Open Problems | Michael Mitzenmacher et.al. | 2503.07545 |
| 2025-03-10 | DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems | Junwei Yu et.al. | 2503.07675 |
| 2025-03-10 | TokenButler: Token Importance is Predictable | Yash Akhauri et.al. | 2503.07518 |
| 2025-03-09 | Seesaw: High-throughput LLM Inference via Model Re-sharding | Qidong Su et.al. | 2503.06433 |
| 2025-03-07 | DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference | Jinwei Yao et.al. | 2404.00242_(DATE) |
| 2025-03-06 | LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression | Souvik Kundu et.al. | 2503.04982_(ACL) |
| 2025-03-06 | Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning | Giulio Corallo et.al. | 2503.04973 |
| 2025-03-06 | Markov Chain of Thought for Efficient Mathematical Reasoning | Wen Yang et.al. | 2410.17635_(ACL) |
| 2025-03-06 | DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) | Zongxin Yang et.al. | 2401.08392 |
| 2025-03-05 | Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line | Muhammad Waseem et.al. | 2503.03889 |
| 2025-03-04 | Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression | Nathan Godey et.al. | 2503.02812 |
| 2025-03-03 | WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models | Jian Yuan et.al. | 2503.01330_(ICASSP) |
| 2025-03-01 | Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving | Qihui Zhou et.al. | 2503.00392 |
| 2025-03-01 | Streaming Video Question-Answering with In-context Video KV-Cache Retrieval | Shangzhe Di et.al. | 2503.00540_(ICLR) |
| 2025-02-28 | ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments | Pedro Gimenes et.al. | 2502.21208 |
| 2025-02-27 | ThinK: Thinner Key Cache by Query-Driven Pruning | Yuhui Xu et.al. | 2407.21018_(ICLR) |
| 2025-02-27 | TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning | Soumyabrata Chaudhuri et.al. | 2502.20508 |
| 2025-02-27 | EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance | Yingxin Li et.al. | 2412.08521 |
| 2025-02-24 | ELMo-Tune-V2: LLM-Assisted Full-Cycle Auto-Tuning to Optimize LSM-Based Key-Value Stores | Viraj Thakkar et.al. | 2502.17606 |
| 2025-02-24 | The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve? | Zhenheng Tang et.al. | 2502.17535 |
| 2025-02-22 | AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure | The AIBrix Team et.al. | 2504.03648 |
| 2025-02-20 | Compute Or Load KV Cache? Why Not Both? | Shuowei Jin et.al. | 2410.03065 |
| 2025-02-20 | SpinQuant: LLM quantization with learned rotations | Zechun Liu et.al. | 2405.16406_(ICLR) |
| 2025-02-20 | Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents | Axel Backlund et.al. | 2502.15840 |
| 2025-02-20 | Plan-over-Graph: Towards Parallelable LLM Agent Schedule | Shiqi Zhang et.al. | 2502.14563 |
| 2025-02-20 | EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts | Subhajit Chaudhury et.al. | 2502.14280 |
| 2025-02-20 | More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression | Jiebin Zhang et.al. | 2412.12706 |
| 2025-02-19 | Autellix: An Efficient Serving Engine for LLM Agents as General Programs | Michael Luo et.al. | 2502.13965 |
| 2025-02-19 | Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference | Qingfa Xiao et.al. | 2502.13542 |
| 2025-02-17 | Does RAG Really Perform Bad For Long-Context Processing? | Kun Luo et.al. | 2502.11444 |
| 2025-02-16 | An Intelligent Agentic System for Complex Image Restoration Problems | Kaiwen Zhu et.al. | 2410.17809_(ICLR) |
| 2025-02-16 | CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation | Kun-Hui Lee et.al. | 2502.11101 |
| 2025-02-11 | HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment | Youhe Jiang et.al. | 2502.07903_(ICLR) |
| 2025-02-06 | Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents | Chenyang Shao et.al. | 2502.04392 |
| 2025-02-05 | Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring | Dongyoung Lee et.al. | 2501.13331 |
| 2025-02-05 | Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation | Shubham Agarwal et.al. | 2502.15734_(SIGMOD) |
| 2025-02-04 | LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation | Xuan Zhang et.al. | 2410.13846 |
| 2025-02-02 | RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations | Zunhai Su et.al. | 2501.16383 |
| 2025-01-29 | vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention | Ramya Prabhu et.al. | 2405.04437_(ASPLOS) |
| 2025-01-29 | MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning | Edward Y. Chang et.al. | 2501.16689 |
| 2025-01-27 | PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization | Mengzhao Chen et.al. | 2410.05265 |
| 2025-01-27 | LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System | Tianfu Wang et.al. | 2501.15749_(WWW) |
| 2025-01-25 | Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads | Xingyang He et.al. | 2501.15113 |
| 2025-01-23 | A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention | Heejun Lee et.al. | 2406.09827 |
| 2025-01-22 | Yi-Lightning Technical Report | Alan Wake et.al. | 2412.01253 |
| 2025-01-17 | BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching | Zhen Zheng et.al. | 2412.03594 |
| 2025-01-14 | CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning | Guoliang He et.al. | 2501.08071_(CGO) |
| 2025-01-12 | Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management | Liu Qianli et.al. | 2501.06709 |
| 2025-01-06 | The Power of Negative Zero: Datatype Customization for Quantized Large Language Models | Yuzong Chen et.al. | 2501.04052_(ISS) |
| 2024-12-31 | RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval | Di Liu et.al. | 2409.10516 |
| 2024-12-24 | TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications | Neiwen Ling et.al. | 2412.18695 |
| 2024-12-23 | Deliberation in Latent Space via Differentiable Cache Augmentation | Luyang Liu et.al. | 2412.17747 |
| 2024-12-22 | VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things | Yaoyao Zhong et.al. | 2312.00401_(AAAI) |
| 2024-12-21 | MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | Cunchen Hu et.al. | 2406.17565 |
| 2024-12-21 | SYMPHONY: Improving Memory Management for LLM Inference Workloads | Saurabh Agarwal et.al. | 2412.16434 |
| 2024-12-18 | MagicPIG: LSH Sampling for Efficient LLM Generation | Zhuoming Chen et.al. | 2410.16179 |
| 2024-12-17 | A System for Microserving of LLMs | Hongyi Jin et.al. | 2412.12488 |
| 2024-12-16 | CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation | Hongxuan Zhang et.al. | 2412.11741 |
| 2024-12-16 | Steering Language Models with Game-Theoretic Solvers | Ian Gemp et.al. | 2402.01704 |
| 2024-12-15 | LAW: Legal Agentic Workflows for Custody and Fund Services Contracts | William Watson et.al. | 2412.11063_(COLING) |
| 2024-12-13 | KVDirect: Distributed Disaggregated LLM Inference | Shiyang Chen et.al. | 2501.14743 |
| 2024-12-06 | Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern | Hongyin Tang et.al. | 2412.04757 |
| 2024-12-05 | A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts | Suyu Ge et.al. | 2410.01485 |
| 2024-11-27 | FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving | Ao Shen et.al. | 2411.18424 |
| 2024-11-22 | Rapid Integration of LLMs in Healthcare Raises Ethical Concerns: An Investigation into Deceptive Patterns in Social Robots | Robert Ranisch et.al. | 2410.00434 |
| 2024-11-14 | Large Language Models for Power Scheduling: A User-Centric Approach | Thomas Mongaillard et.al. | 2407.00476 |
| 2024-11-08 | Eigen Attention: Attention in Low-Rank Space for KV Cache Compression | Utkarsh Saxena et.al. | 2408.05646 |
| 2024-11-05 | AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution | Zhiqiang Xie et.al. | 2411.03519 |
| 2024-11-05 | SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction | Shlomo Neuberger et.al. | 2411.03397 |
| 2024-11-03 | A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression | Alessio Devoto et.al. | 2406.11430_(EMNLP) |
| 2024-11-02 | NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | Xuanlin Jiang et.al. | 2411.01142 |
| 2024-11-01 | Understanding Communication Preferences of Information Workers in Engagement with Text-Based Conversational Agents | Ananya Bhattacharjee et.al. | 2410.20468 |
| 2024-10-31 | ALISE: Accelerating Large Language Model Serving with Speculative Scheduling | Youpeng Zhao et.al. | 2410.23537_(ICC) |
| 2024-10-25 | Fast Inference for Augmented Large Language Models | Rana Shahout et.al. | 2410.18248 |
| 2024-10-21 | Do Large Language Models Need a Content Delivery Network? | Yihua Cheng et.al. | 2409.13761 |
| 2024-10-17 | LLoCO: Learning Long Contexts Offline | Sijun Tan et.al. | 2404.07979_(EMNLP) |
| 2024-10-16 | COMET: Towards Partical W4A4KV4 LLMs Serving | Lian Liu et.al. | 2410.12168 |
| 2024-10-14 | DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads | Guangxuan Xiao et.al. | 2410.10819 |
| 2024-10-11 | OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents | Yuwei Yan et.al. | 2410.21286 |
| 2024-10-09 | LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management | Yi Xiong et.al. | 2410.00428 |
| 2024-10-08 | KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches | Jiayi Yuan et.al. | 2407.01527 |
| 2024-10-07 | Fast State Restoration in LLM Serving with HCache | Shiwei Gao et.al. | 2410.05004_(EuroSys) |
| 2024-10-07 | KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head | Isaac Rehg et.al. | 2410.00161 |
| 2024-10-06 | SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance | Connor Walker et.al. | 2410.10852 |
| 2024-10-04 | LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy | Rongzhi Zhang et.al. | 2410.03111 |
| 2024-10-03 | Preble: Efficient Distributed Prompt Scheduling for LLM Serving | Vikranth Srivatsa et.al. | 2407.00023 |
| 2024-10-03 | Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1 | Karthik Valmeekam et.al. | 2410.02162 |
| 2024-09-23 | BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models | Bodun Hu et.al. | 2404.18322 |
| 2024-09-16 | Scalable Differential Privacy Mechanisms for Real-Time Machine Learning Applications | Jessica Smith et.al. | 2410.02462 |
| 2024-09-11 | Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU | Zhenyu Ning et.al. | 2409.09086 |
| 2024-08-05 | SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving | Andreas Kosmas Kakolyris et.al. | 2408.05235 |
| 2024-08-04 | TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding | Hanshi Sun et.al. | 2404.11912 |
| 2024-08-01 | ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition | Lu Ye et.al. | 2402.15220_(ACL) |
| 2024-07-26 | Collaborative Evolving Strategy for Automatic Data-Centric Development | Xu Yang et.al. | 2407.18690 |
| 2024-07-25 | KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache | Zirui Liu et.al. | 2402.02750_(ICML) |
| 2024-07-23 | Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference | Piotr Nawrot et.al. | 2403.09636 |
| 2024-07-22 | vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving | Jiale Xu et.al. | 2407.15309 |
| 2024-07-22 | RazorAttention: Efficient KV Cache Compression Through Retrieval Heads | Hanlin Tang et.al. | 2407.15891 |
| 2024-07-21 | Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks | Zheng Wang et.al. | 2407.08454 |
| 2024-07-19 | CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving | Yuhan Liu et.al. | 2310.07240_(SIGCOMM) |
| 2024-07-18 | QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead | Amir Zandieh et.al. | 2406.03482 |
| 2024-07-11 | Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs | Ben Athiwaratkun et.al. | 2403.08845 |
| 2024-06-30 | Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention | Bin Gao et.al. | 2403.19708_(ATC) |
| 2024-06-28 | InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management | Wonbeom Lee et.al. | 2406.19707_(OSDI) |
| 2024-06-16 | EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism | Yanxi Chen et.al. | 2312.04916_(ICML) |
| 2024-06-08 | QCQA: Quality and Capacity-aware grouped Query Attention | Vinay Joshi et.al. | 2406.10247 |
| 2024-06-06 | SGLang: Efficient Execution of Structured Language Model Programs | Lianmin Zheng et.al. | 2312.07104 |
| 2024-05-13 | Hydragen: High-Throughput LLM Inference with Shared Prefixes | Jordan Juravsky et.al. | 2402.05099 |
| 2024-05-06 | Federated Reinforcement Learning with Constraint Heterogeneity | Hao Jin et.al. | 2405.03236 |
| 2024-05-01 | Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing | KV Aditya Srivatsa et.al. | 2405.00467_(ACL) |
| 2024-04-15 | Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models | Siyan Zhao et.al. | 2404.09529 |
| 2024-04-06 | The Case for Developing a Foundation Model for Planning-like Tasks from Scratch | Biplav Srivastava et.al. | 2404.04540 |
| 2024-03-26 | ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching | Youpeng Zhao et.al. | 2403.17312_(ISCA) |
| 2024-03-18 | FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Jiaao He et.al. | 2403.11421 |
| 2024-03-04 | DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving | Foteini Strati et.al. | 2403.01876 |
| 2024-03-04 | LLM-based Smart Reply (LSR): Enhancing Collaborative Performance with ChatGPT-mediated Smart Reply System | Ashish Bastola et.al. | 2306.11980 |
| 2024-02-04 | Conversational Crowdsensing: A Parallel Intelligence Powered Novel Sensing Approach | Zhengqiu Zhu et.al. | 2402.06654 |
| 2024-01-20 | On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS) | Vishal Pallagani et.al. | 2401.02500 |
| 2023-12-26 | Natural Language based Context Modeling and Reasoning for Ubiquitous Computing with Large Language Models: A Tutorial | Haoyi Xiong et.al. | 2309.15074 |
| 2023-11-09 | Towards A Natural Language Interface for Flexible Multi-Agent Task Assignment | Jake Brawer et.al. | 2311.00153 |
| 2023-10-30 | SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models | Hongxin Li et.al. | 2305.19308_(NeurIPS) |
| 2023-09-19 | MindAgent: Emergent Gaming Interaction | Ran Gong et.al. | 2309.09971 |
| 2023-09-12 | Efficient Memory Management for Large Language Model Serving with PagedAttention | Woosuk Kwon et.al. | 2309.06180_(SOSP) |
| 2023-06-09 | S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput | Yunho Jin et.al. | 2306.06000 |