LLM Papers

Updated on 2026.04.03

Publish Date Title Authors PDF
2026-04-01 Universal YOCO for Efficient Depth Scaling Yutao Sun et.al. 2604.01220
2026-03-30 Low-Latency Edge LLM Handover via Joint KV Cache Transfer and Token Prefill Seunghun Lee et.al. 2603.28018
2026-03-29 Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts Sijia Luo et.al. 2601.10079
2026-03-28 CoDec: Prefix-Shared Decoding Kernel for LLMs Zhibin Wang et.al. 2505.17694
2026-03-27 LiteCache: A Query Similarity-Driven, GPU-Centric KVCache Subsystem for Efficient LLM Inference Jiawei Yi et.al. 2511.14510
2026-03-26 Extending Puzzle for Mixture-of-Experts Reasoning Models with Application to GPT-OSS Acceleration Akhiad Bercovich et.al. 2602.11937
2026-03-25 Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning Adnan Oomerjee et.al. 2505.16950
2026-03-25 ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators Guoqiang Zou et.al. 2512.09427
2026-03-25 LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling Dingyan Zhang et.al. 2603.15202
2026-03-24 PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving Wenfeng Wang et.al. 2603.23049
2026-03-24 StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving Azam Nouri et.al. 2603.28795
2026-03-22 The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project Huamin Chen et.al. 2603.21354
2026-03-20 Understanding and Optimizing Multi-Stage AI Inference Pipelines Abhimanyu Rajeshkumar Bambhaniya et.al. 2504.09775
2026-03-20 KV Cache Optimization Strategies for Scalable and Efficient LLM Inference Yichun Xu et.al. 2603.20397
2026-03-20 Trained Persistent Memory for Frozen Decoder-Only LLMs Hong Jeong et.al. 2603.22329
2026-03-19 StreamingThinker: Large Language Models Can Think While Reading Junlong Tong et.al. 2510.17238_(ICLR)
2026-03-18 Multi-stage Flow Scheduling for LLM Serving Yijun Sun et.al. 2603.17456
2026-03-18 The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency Huamin Chen et.al. 2603.17280
2026-03-18 Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs Tuowei Wang et.al. 2603.17803
2026-03-18 Learning When to Attend: Conditional Memory Access for Long-Context LLMs Sakshi Choudhary et.al. 2603.17484
2026-03-18 IEMAS: An Incentive-Efficiency Routing Framework for Open Agentic Web Ecosystems Hongze Liu et.al. 2603.17302
2026-03-17 EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval Zebin Yang et.al. 2510.18546_(NeurIPS)
2026-03-17 Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective Noppanat Wadlom et.al. 2603.16104
2026-03-17 Efficient Reasoning on the Edge Yelysei Bondarenko et.al. 2603.16867
2026-03-16 PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel Jinjun Yi et.al. 2511.22333_(ASPLOS)
2026-03-16 Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation Yanick Zengaffinen et.al. 2603.15547
2026-03-15 Learning to Forget: Sleep-Inspired Memory Consolidation for Resolving Proactive Interference in Large Language Models Ying Xie et.al. 2603.14517
2026-03-15 Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys Xu Yang et.al. 2603.14224
2026-03-14 StreamingTOM: Streaming Token Compression for Efficient Video Understanding Xueyi Chen et.al. 2510.18269_(CVPR)
2026-03-14 DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs Lizhuo Luo et.al. 2602.05992
2026-03-13 Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity Donglin Yu et.al. 2603.12707
2026-03-13 Orla: A Library for Serving LLM-Based Multi-Agent Systems Rana Shahout et.al. 2603.13605
2026-03-13 StatePlane: A Cognitive State Plane for Long-Horizon AI Systems Under Bounded Context Sasank Annapureddy et.al. 2603.13644
2026-03-12 Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache Xinhai Wang et.al. 2603.13420
2026-03-11 KV Cache Transform Coding for Compact Storage in LLM Inference Konrad Staniszewski et.al. 2511.01815_(ICLR)
2026-03-11 Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents Kaiyu Zhou et.al. 2601.10955
2026-03-10 Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework Kerui Huang et.al. 2509.14093
2026-03-10 ThunderAgent: A Simple, Fast and Program-Aware Agentic Inference System Hao Kang et.al. 2602.13692
2026-03-09 FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference Guangda Liu et.al. 2505.13109
2026-03-09 EAGLE-Pangu: Accelerator-Safe Tree Speculative Decoding on Ascend NPUs Chang Han et.al. 2603.08088
2026-03-09 LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing Dongfang Li et.al. 2603.08453
2026-03-09 Not All Prefills Are Equal: PPD Disaggregation for Multi-turn LLM Serving Zongze Li et.al. 2603.13358
2026-03-08 Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs Raghavv Goel et.al. 2603.07475_(ICLR)
2026-03-06 Good-Enough LLM Obfuscation (GELO) Anatoly Belikov et.al. 2603.05035
2026-03-06 MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens Yu Chen et.al. 2603.23516
2026-03-05 Hardware-Software Co-design for 3D-DRAM-based LLM Serving Accelerator Cong Li et.al. 2603.04797
2026-03-05 InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context Xin Teng et.al. 2603.05353
2026-03-03 xLLM Technical Report Tongxuan Liu et.al. 2510.14686
2026-03-03 Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving Rui Li et.al. 2512.22420
2026-03-02 AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size Guanxi Lu et.al. 2509.26432_(ICLR)
2026-03-02 OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration Xinyue Ma et.al. 2601.10729_(VLDB)
2026-03-02 Multi-Layer Scheduling for MoE-Based LLM Reasoning Yifan Sun et.al. 2602.21626
2026-03-02 FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving Shouwei Gao et.al. 2602.22593_(ICS)
2026-03-02 Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics Samhruth Ananthanarayanan et.al. 2603.01426
2026-03-01 Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs Ngoc Bui et.al. 2512.03324
2026-03-01 Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention Mengqi Liao et.al. 2603.08743
2026-02-28 FASA: Frequency-aware Sparse Attention Yifei Wang et.al. 2602.03152_(ICLR)
2026-02-28 RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse Yingsheng Geng et.al. 2603.13289
2026-02-27 SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning Sanjay Kariyappa et.al. 2602.22603
2026-02-27 ICaRus: Identical Cache Reuse for Efficient Multi Model Inference Sunghyeon Woo et.al. 2603.13281
2026-02-26 DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference Yongtong Wu et.al. 2602.21548
2026-02-23 ContextPilot: Fast Long-Context Inference via Context Reuse Yinsicheng Jiang et.al. 2511.03475
2026-02-21 Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs Subham Sekhar Sahoo et.al. 2506.01928
2026-02-20 Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning Lexiang Tang et.al. 2602.18232
2026-02-19 ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs Jianlong Lei et.al. 2603.08727
2026-02-17 Intermittent Semi-Working Mask: A New Masking Paradigm for LLMs HaoYuan Hu et.al. 2408.00539
2026-02-14 KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider Jiahao Wang et.al. 2506.02634_(ATC)
2026-02-13 Doc-to-LoRA: Learning to Instantly Internalize Contexts Rujikorn Charakorn et.al. 2602.15902
2026-02-12 Efficient Remote Prefix Fetching with GPU-native Media ASICs Liang Mi et.al. 2602.09725
2026-02-12 SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining Yifan Zhang et.al. 2602.10718
2026-02-12 PrefillShare: A Shared Prefill Module for KV Reuse in Multi-LLM Disaggregated Serving Sunghyeon Woo et.al. 2602.12029
2026-02-12 GORGO: Maximizing KV-Cache Reuse While Minimizing Network Latency in Cross-Region LLM Load Balancing Alessio Ricci Toniolo et.al. 2602.11688
2026-02-12 PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System Lian Liu et.al. 2602.11521
2026-02-10 LLM Serving Optimization with Variable Prefill and Decode Lengths Meixuan Wang et.al. 2508.06133
2026-02-10 ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs Yanlin Qi et.al. 2602.07721
2026-02-10 Learning to Evict from Key-Value Cache Luca Moschella et.al. 2602.10238
2026-02-09 Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction Jang-Hyun Kim et.al. 2601.17668_(FAST)
2026-02-09 Near-Oracle KV Selection via Pre-hoc Sparsity for Long-Context Inference Yifei Gao et.al. 2602.08329
2026-02-09 Selective KV-Cache Sharing to Mitigate Timing Side-Channels in LLM Inference Kexin Chu et.al. 2508.08438
2026-02-08 Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model Tianyi Wang et.al. 2602.07878
2026-02-08 DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity Jitai Hao et.al. 2602.08005
2026-02-06 DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving Ying Yuan et.al. 2602.06502
2026-02-03 Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning Zhicheng Yang et.al. 2602.03249
2026-02-03 ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution Zican Dong et.al. 2602.03203
2026-02-03 PackInfer: Compute- and I/O-Efficient Attention for Batched LLM Inference Rui Ning et.al. 2602.06072
2026-02-02 RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse Mingrui Liu et.al. 2602.01795
2026-02-02 CoMeT: Collaborative Memory Transformer for Efficient Long Context Modeling Runsong Zhao et.al. 2602.01766
2026-02-02 You Need an Encoder for Native Position-Independent Caching Shiju Zhao et.al. 2602.01519
2026-02-02 State Rank Dynamics in Linear Attention LLMs Ao Sun et.al. 2602.02195
2026-01-31 FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning Hao Mark Chen et.al. 2509.00195_(ASPLOS)
2026-01-30 Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models Elias Hossain et.al. 2510.17098
2026-01-30 Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live Hanchen Li et.al. 2511.02230
2026-01-30 CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control Qiaoling Chen et.al. 2601.22705
2026-01-30 Towards Resiliency in Large Language Model Serving with KevlarFlow Shangshu Qian et.al. 2601.22438
2026-01-30 Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference Yiding Feng et.al. 2601.22996
2026-01-30 Harvest: Opportunistic Peer-to-Peer GPU Caching for LLM Inference Nikhil Gopal et.al. 2602.00328
2026-01-29 Theoretically Optimal Attention/FFN Ratios in Disaggregated LLM Serving Chendong Song et.al. 2601.21351_(ICML)
2026-01-29 Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis Qingyue Yang et.al. 2601.21709_(ICLR)
2026-01-28 SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips Jiahuan Yu et.al. 2601.20309
2026-01-28 Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning Zeyu Xing et.al. 2601.20326_(ICLR)
2026-01-28 ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference Ketan Thakkar et.al. 2601.21109
2026-01-27 Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding Zhongyu Xiao et.al. 2601.17917
2026-01-26 Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective Fangzhou Wu et.al. 2601.18999_(ICLR)
2026-01-21 QMC: Efficient SLM Edge Inference via Outlier-Aware Quantization and Emergent Memories Co-Design Nilesh Prasad Pandey et.al. 2601.14549
2026-01-20 KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments Junyoung Park et.al. 2504.15364_(NeurIPS)
2026-01-20 ContiguousKV: Accelerating LLM Prefill with Granularity-Aligned KV Cache Management Jing Zou et.al. 2601.13631
2026-01-20 HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference Zhiyuan Shi et.al. 2601.13684
2026-01-20 LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems Badri N. Patro et.al. 2601.14053
2026-01-19 Cache Your Prompt When It’s Green: Carbon-Aware Caching for Large Language Model Serving Yuyang Tian et.al. 2505.23970
2026-01-19 SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences Jungyoub Cha et.al. 2505.20776
2026-01-19 Batch Query Processing and Optimization for Agentic Workflows Junyi Shen et.al. 2509.02121
2026-01-19 Sutradhara: An Intelligent Orchestrator-Engine Co-design for Tool-based Agentic Inference Anish Biswas et.al. 2601.12967
2026-01-19 From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation Jiahao Wang et.al. 2601.12904
2026-01-18 Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline Jiawei Xu et.al. 2601.12307
2026-01-16 RAPID-Serve: Resource-efficient and Accelerated P/D Intra-GPU Disaggregation Amna Masood et.al. 2601.11822
2026-01-15 Online Scheduling for LLM Inference with KV Cache Constraints Patrick Jaillet et.al. 2502.07115
2026-01-15 AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving Shaoting Feng et.al. 2509.00105_(SOSP)
2026-01-15 Hardware Acceleration for Neural Networks: A Comprehensive Survey Bin Xu et.al. 2512.23914
2026-01-14 APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs Jiakun Fan et.al. 2506.03296
2026-01-14 CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation Hasan Akgul et.al. 2510.19670
2026-01-14 Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling Zhixiang Liang et.al. 2601.09093
2026-01-13 TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL Jinbo Su et.al. 2601.08743
2026-01-12 Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference Rei Taniguchi et.al. 2601.07667_(CHI)
2026-01-11 Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models Haoyu Wang et.al. 2506.07334
2026-01-11 Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention Zhen Yang et.al. 2510.13940
2026-01-09 AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving Tianhao Xu et.al. 2601.06288
2026-01-07 MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing Zhaoyuan Su et.al. 2506.02006
2026-01-07 InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing Shuaiyi Li et.al. 2505.22156
2026-01-06 Making MoE-based LLM Inference Resilient with Tarragon Songyu Zhang et.al. 2601.01310
2026-01-06 Joint Encoding of KV-Cache Blocks for Scalable LLM Serving Joseph Kampeas et.al. 2601.03067
2026-01-05 Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints Ruicheng Ao et.al. 2504.11320
2026-01-05 LoRA-Drop: Temporal LoRA Decoding for Efficient LLM Inference Hossein Rajabzadeh et.al. 2601.02569
2026-01-05 Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle Zihan Wang et.al. 2601.16986
2026-01-03 Warp-Cortex: An Asynchronous, Memory-Efficient Architecture for Million-Agent Cognitive Scaling on Consumer Hardware Jorge L. Ruiz Williams et.al. 2601.01298
2026-01-03 KV-Embedding: Training-free Text Embedding via Internal KV Re-routing in Decoder-only LLMs Yixuan Tang et.al. 2601.01046
2025-12-28 Attention Is All You Need for KV Cache in Diffusion LLMs Quan Nguyen-Tri et.al. 2510.14973
2025-12-28 WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference Aiwei Liu et.al. 2512.22737
2025-12-25 PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System Hyucksung Kwon et.al. 2412.20166_(CHI)
2025-12-25 Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths Ahilan Ayyachamy Nadar Ponnusamy et.al. 2601.11564
2025-12-24 V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval Donghyuk Kim et.al. 2512.12284_(HPCA)
2025-12-22 MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning Tao Zhang et.al. 2512.19206
2025-12-20 TraCT: Disaggregated LLM Serving with CXL Shared Memory KV Cache at Rack-Scale Dongha Yoon et.al. 2512.18194
2025-12-20 MatKV: Trading Compute for Flash Storage in LLM Inference Kun-Woo Shin et.al. 2512.22195_(ICDE)
2025-12-19 xGR: Efficient Generative Recommendation Serving at Scale Qingxiao Sun et.al. 2512.11529
2025-12-18 MEPIC: Memory Efficient Position Independent Caching for LLM Serving Qian Wang et.al. 2512.16822
2025-12-17 CTkvr: KV Cache Retrieval for Long-Context LLMs via Centroid then Token Indexing Kuan Lu et.al. 2512.15550
2025-12-17 Dynamic Rebatching for Efficient Early-Exit Inference with DREX Xuting Liu et.al. 2512.15705
2025-12-16 SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching Xinye Zhao et.al. 2509.24832
2025-12-16 Astraea: A State-Aware Scheduling Engine for LLM-Powered Agents Hongqiu Ni et.al. 2512.14142
2025-12-16 EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving Shaoting Feng et.al. 2512.14946
2025-12-16 Understanding Bottlenecks for Efficiently Serving LLM Inference With KV Offloading William Meng et.al. 2601.19910
2025-12-15 Uncovering the Role of Initial Saliency in U-Shaped Attention Bias: Scaling Initial Token Weight for Enhanced Long-Text Processing Zewen Qiang et.al. 2512.13109
2025-12-14 Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving Hui Zeng et.al. 2511.06029_(AAAI)
2025-12-12 Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference Adilet Metinov et.al. 2512.11221
2025-12-12 Hold Onto That Thought: Assessing KV Cache Compression On Reasoning Minghui Liu et.al. 2512.12008
2025-12-11 Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders Qingsen Ma et.al. 2512.10547
2025-12-11 CXL-SpecKV: A Disaggregated FPGA Speculative KV-Cache for Datacenter LLM Serving Dong Liu et.al. 2512.11920_(FPGA)
2025-12-10 SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators Jonathan Li et.al. 2511.03092
2025-12-10 PerCache: Predictive Hierarchical Cache for RAG Applications on Mobile Devices Kaiwei Liu et.al. 2601.11553
2025-12-08 H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference Zizhuo Fu et.al. 2508.16653_(ICC)
2025-12-08 Leveraging KV Similarity for Online Structured Pruning in LLMs Jungmin Lee et.al. 2512.07090
2025-12-07 KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models Sourjya Roy et.al. 2512.06727
2025-12-07 ELANA: A Simple Energy and Latency Analyzer for LLMs Hung-Yueh Chiang et.al. 2512.09946
2025-12-05 LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference Yuhan Liu et.al. 2510.09665
2025-12-04 KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs Prashant Pandey et.al. 2512.11851
2025-12-02 SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification Zhendong Tan et.al. 2512.02337
2025-12-01 Tangram: Accelerating Serverless LLM Loading through GPU Memory Reuse and Affinity Wenbin Zhu et.al. 2512.01357
2025-12-01 KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction Aomufei Yuan et.al. 2512.17917
2025-11-30 SIMPLE: Disaggregating Sampling from GPU Inference into a Decision Plane for Faster Distributed LLM Serving Bohan Zhao et.al. 2512.00719
2025-11-30 SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs Jiaming Xu et.al. 2512.00722_(ASPLOS)
2025-11-29 G-KV: Decoding-Time KV Cache Eviction with Global Attention Mengqi Liao et.al. 2512.00504
2025-11-27 Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression Boris Kriuk et.al. 2512.17914
2025-11-26 No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha Amey Agrawal et.al. 2409.17264
2025-11-26 Efficient Multi-Adapter LLM Serving via Cross-Model KV-Cache Reuse with Activated LoRA Allison Li et.al. 2512.17910
2025-11-25 On 10x Better Scalability: KV Stores Scale Up KV Cache Weiping Yu et.al. 2511.16138
2025-11-25 Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation Inferix Team et.al. 2511.20714
2025-11-24 SWAN: Sparse Winnowed Attention for Reduced Inference Memory via Decompression-Free KV-Cache Compression Santhosh G S et.al. 2511.18936
2025-11-24 ThreadWeaver: Adaptive Threading for Efficient Parallel Reasoning in Language Models Long Lian et.al. 2512.07843
2025-11-23 Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost Haojun Xia et.al. 2511.18643
2025-11-20 KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference Xing Li et.al. 2502.04420_(ICML)
2025-11-18 Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models Rui Zhu et.al. 2511.14694
2025-11-17 Hogwild! Inference: Parallel LLM Generation via Concurrent Attention Gleb Rodionov et.al. 2504.06261_(NeurIPS)
2025-11-14 Revisiting Disaggregated Large Language Model Serving for Performance and Energy Implications Jiaxi Li et.al. 2601.08833
2025-11-13 LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components Yaru Li et.al. 2511.10394_(ISS)
2025-11-13 $A^3$: Attention-Aware Accurate KV Cache Fusion for Fast Large Language Model Serving Yuechi Zhou et.al. 2511.17560
2025-11-11 Glia: A Human-Inspired AI for Automated Systems Design and Optimization Pouya Hamadanian et.al. 2510.27176
2025-11-10 KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse Jingbo Yang et.al. 2502.16002
2025-11-10 StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression Yilong Chen et.al. 2511.07278
2025-11-09 SR-KI: Scalable and Real-Time Knowledge Integration into LLMs via Supervised Attention Bohan Yu et.al. 2511.06446_(AAAI)
2025-11-08 Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching Yanhao Dong et.al. 2504.06319
2025-11-07 Inference-Time Hyper-Scaling with KV Cache Compression Adrian Łańcucki et.al. 2506.05345_(NeurIPS)
2025-11-07 Learned Structure in Cartridges: Keys as Shareable Routers in Self-Studied Representations Maurizio Diaz et.al. 2508.17032
2025-11-06 SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference Tian Xia et.al. 2505.24095
2025-11-06 HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts Neil He et.al. 2505.24722
2025-11-06 Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing Mingyu Sung et.al. 2511.04002
2025-11-06 DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing Lei Gao et.al. 2511.04791
2025-11-05 FREESH: Fair, Resource- and Energy-Efficient Scheduling for LLM Serving on Heterogeneous GPUs Xuan He et.al. 2511.00807_(ISS)
2025-11-05 ALAS: Transactional and Dynamic Multi-Agent LLM Planning Longling Geng et.al. 2511.03094
2025-11-05 AnchorTP: Resilient LLM Inference with State-Preserving Elastic Tensor Parallelism Wendong Xu et.al. 2511.11617_(DATE)
2025-11-04 LiveSecBench: A Dynamic and Culturally-Relevant AI Safety Benchmark for LLMs in Chinese Context Yudong Li et.al. 2511.02366
2025-11-04 Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration Jingbo Wang et.al. 2511.02200
2025-11-03 Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving Chengying Huan et.al. 2511.01633
2025-11-03 TPS-Bench: Evaluating AI Agents’ Tool Planning \& Scheduling Abilities in Compounding Tasks Hanwen Xu et.al. 2511.01527
2025-11-02 HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL You Peng et.al. 2505.05286
2025-11-02 FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management Nazmul Takbir et.al. 2511.00868
2025-11-01 KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems Hancheng Ye et.al. 2510.12872_(FAST)
2025-11-01 A CPU-Centric Perspective on Agentic AI Ritik Raj et.al. 2511.00739
2025-11-01 EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory Wenzhe Fan et.al. 2511.01912
2025-11-01 Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization Massinissa Merouani et.al. 2511.00592_(CHI)
2025-10-31 Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications Zhuohang Bian et.al. 2510.18586
2025-10-31 Scalable Processing-Near-Memory for 1M-Token LLM Inference: CXL-Enabled KV-Cache Management Beyond GPU Limits Dowon Kim et.al. 2511.00321
2025-10-30 Agentic AI Home Energy Management System: A Large Language Model Framework for Residential Load Scheduling Reda El Makroum et.al. 2510.26603
2025-10-29 Oneiros: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving Ruihao Li et.al. 2507.11507
2025-10-29 Serve Programs, Not Prompts In Gim et.al. 2510.25412_(HotOS)
2025-10-29 KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA Zhuo Chen et.al. 2510.25101
2025-10-28 Pie: A Programmable Serving System for Emerging LLM Applications In Gim et.al. 2510.24051_(SOSP)
2025-10-28 Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion Xianjun Gao et.al. 2510.24390
2025-10-28 From Narrative to Action: A Hierarchical LLM-Agent Framework for Human Mobility Generation Qiumeng Li et.al. 2510.24802
2025-10-26 Batch Speculative Decoding Done Right Ranran Haoran Zhang et.al. 2510.22876
2025-10-26 SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size Jinhan Chen et.al. 2510.22556
2025-10-26 SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming Adhyayan Veer Singh et.al. 2510.22626
2025-10-24 Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning Jiwon Song et.al. 2505.13866
2025-10-23 Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning Yu Fu et.al. 2410.19258_(ICLR)
2025-10-23 Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution Zhiyang Chen et.al. 2510.15312
2025-10-23 HA-RAG: Hotness-Aware RAG Acceleration via Mixed Precision and Data Placement Danying Ge et.al. 2510.20878
2025-10-22 PTFA: An LLM-based Agent that Facilitates Online Consensus Building through Parallel Thinking Wen Gu et.al. 2503.12499
2025-10-21 The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems Linke Song et.al. 2409.20002_(ICS)
2025-10-21 Reasoning Language Model Inference Serving Unveiled: An Empirical Study Qi Li et.al. 2510.18672
2025-10-21 The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability Zijie Xu et.al. 2510.18563
2025-10-19 STARK: Strategic Team of Agents for Refining Kernels Juncheng Dong et.al. 2510.16996
2025-10-18 Ripple Effect Protocol: Coordinating Agent Populations Ayush Chopra et.al. 2510.16572
2025-10-16 Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies Mason Nakamura et.al. 2510.14312
2025-10-16 Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing Tianhua Xia et.al. 2510.16040
2025-10-15 LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning Haoyue Zhang et.al. 2506.15969
2025-10-15 BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure Yiyuan He et.al. 2510.13223
2025-10-15 Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving Nikos Pagonas et.al. 2510.14126
2025-10-14 Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models Rabimba Karanjai et.al. 2510.12080
2025-10-13 MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE Soheil Zibakhsh et.al. 2509.17238
2025-10-13 Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models Junhyuck Kim et.al. 2510.10964
2025-10-13 Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony Han Lu et.al. 2510.11345
2025-10-11 CacheClip: Accelerating RAG with Effective KV Cache Reuse Bin Yang et.al. 2510.10129
2025-10-11 Agentic Troubleshooting Guide Automation for Incident Management Jiayi Mao et.al. 2510.10074
2025-10-10 OrcaLoca: An LLM Agent Framework for Software Issue Localization Zhongming Yu et.al. 2502.00350
2025-10-09 Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval Wenhao Li et.al. 2508.19740
2025-10-08 AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs Peize He et.al. 2510.07293
2025-10-08 AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding Shuqing Luo et.al. 2510.07486
2025-10-08 FLEET: Formal Language-Grounded Scheduling for Heterogeneous Robot Teams Corban Rivera et.al. 2510.07417
2025-10-07 VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization Dingyu Yao et.al. 2510.06175
2025-10-07 H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference Harshil Vejendla et.al. 2510.05529
2025-10-06 ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering Yuki Imajuku et.al. 2506.09050_(NeurIPS)
2025-10-06 Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning Edward Y. Chang et.al. 2510.04488
2025-10-03 TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling Junyi Chen et.al. 2510.02758_(EuroSys)
2025-10-03 Automatic Building Code Review: A Case Study Hanlong Wan et.al. 2510.02634
2025-10-02 QSpec: Speculative Decoding with Complementary Quantization Schemes Juntao Zhao et.al. 2410.11305
2025-10-02 KaVa: Latent Reasoning via Compressed KV-Cache Distillation Anna Kuzina et.al. 2510.02312
2025-10-02 ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models Gursimran Singh et.al. 2510.02613_(ISS)
2025-10-02 KVComm: Enabling Efficient LLM Communication through Selective KV Sharing Xiangyu Shi et.al. 2510.03346
2025-10-01 Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies Kyoungmin Kim et.al. 2411.07447
2025-09-30 KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction Jang-Hyun Kim et.al. 2505.23416_(NeurIPS)
2025-09-30 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges Ranjan Sapkota et.al. 2505.10468
2025-09-30 Towards Agentic OS: An LLM Agent Framework for Linux Schedulers Yusheng Zheng et.al. 2509.01245
2025-09-29 Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models Keda Tao et.al. 2503.16257
2025-09-29 KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation Ching Han Chen et.al. 2505.07618
2025-09-29 SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching Yuxuan Zhu et.al. 2504.00970
2025-09-29 SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving Qihui Zhou et.al. 2509.24626
2025-09-29 SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents Gyuhyeon Seo et.al. 2509.24282
2025-09-28 From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle Kaustubh Vyas et.al. 2412.12839_(ICLR)
2025-09-28 HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models Zhinan Xie et.al. 2509.23928
2025-09-27 Runtime Adaptive Pruning for LLM Inference Huanrong Liu et.al. 2505.17138
2025-09-27 ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration Xianglong Yan et.al. 2505.24357
2025-09-27 READER: Retrieval-Assisted Drafter for Efficient LLM Inference Maxim Divilkovskiy et.al. 2508.09072
2025-09-26 KV Cache Steering for Controlling Frozen LLMs Max Belitsky et.al. 2507.08799
2025-09-26 ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration Gaole Dai et.al. 2509.21823
2025-09-26 Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents Heyang Gao et.al. 2510.03253
2025-09-26 LLM Assisted Alpha Fairness for 6 GHz WiFi and NR_U Coexistence: An Agentic Orchestrator for Throughput, Energy, and SLA Qun Wang et.al. 2510.17814
2025-09-25 HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling Zahra Yousefijamarani et.al. 2508.15919
2025-09-25 Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization Yuhang Xu et.al. 2509.21301
2025-09-24 UNComp: Can Matrix Entropy Uncover Sparsity? – A Compressor Design from an Uncertainty-Aware Perspective Jing Xiong et.al. 2410.03090_(EMNLP)
2025-09-24 Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference Haoyu Chen et.al. 2509.19729
2025-09-24 CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks Jiewei Chen et.al. 2509.19855
2025-09-22 A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor Zhen Zhao et.al. 2405.16887
2025-09-22 Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings Stephen Zhang et.al. 2502.00919
2025-09-22 Asteria: Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access Chaoyi Ruan et.al. 2509.17360
2025-09-21 ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching Xingyu Xiang et.al. 2509.16857
2025-09-20 EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs Zhengge Cai et.al. 2509.16686
2025-09-20 Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games Niv Eckhaus et.al. 2506.05309
2025-09-19 Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap Andrew Zhu et.al. 2509.16325
2025-09-17 CrowdAgent: Multi-Agent Managed Multi-Source Annotation System Maosheng Qin et.al. 2509.14030
2025-09-16 FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference Dongwei Wang et.al. 2508.08256_(EMNLP)
2025-09-15 Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System Yunhua Fang et.al. 2508.13231_(CHI)
2025-09-15 FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving Kyungmin Bin et.al. 2509.06261
2025-09-08 Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing Rui Xie et.al. 2509.03377
2025-09-08 Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models Yinjie Wang et.al. 2509.06949
2025-09-03 ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving Yifan Qiao et.al. 2410.01228
2025-09-01 TRACE-CS: A Hybrid Logic-LLM System for Explainable Course Scheduling Stylianos Loukas Vasileiou et.al. 2409.03671
2025-09-01 LLMs cannot spot math errors, even when allowed to peek into the solution KV Aditya Srivatsa et.al. 2509.01395_(EMNLP)
2025-08-30 DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction Yanqi Zhang et.al. 2412.03131_(SOSP)
2025-08-30 LLM-Assisted Iterative Evolution with Swarm Intelligence Toward SuperBrain Li Weigang et.al. 2509.00510
2025-08-29 Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward Yong Deng et.al. 2508.12800
2025-08-28 TinyServe: Query-Aware Cache Selection for Efficient LLM Serving Dong Liu et.al. 2509.12211_(ACM MM)
2025-08-26 Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing Junyi Wen et.al. 2507.08045
2025-08-26 Strata: Hierarchical Context Caching for Long Context Language Model Serving Zhiqiang Xie et.al. 2508.18572
2025-08-24 PRISM: Efficient Long-Range Reasoning With Short-Context LLMs Dulhan Jayalath et.al. 2412.18914_(EMNLP)
2025-08-21 Efficient Mixed-Precision Large Language Model Inference with TurboMind Li Zhang et.al. 2508.15601
2025-08-20 Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration Peilin Ji et.al. 2508.14654
2025-08-18 Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis Ayoub Ben Chaliah et.al. 2508.13382
2025-08-17 ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads Zhuorui Liu et.al. 2508.12407
2025-08-15 UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs? Mukund Choudhary et.al. 2508.11260
2025-08-14 SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression Mengjie Li et.al. 2508.15806
2025-08-14 ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs Keyu Chen et.al. 2508.08895
2025-08-12 Retrospective Sparse Attention for Efficient Long-Context Generation Seonghwan Choi et.al. 2508.09001
2025-08-12 Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation Jiongchi Yu et.al. 2508.07745
2025-08-12 AIOS: LLM Agent Operating System Kai Mei et.al. 2403.16971
2025-08-11 Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories Ming-Yen Lee et.al. 2508.08457
2025-08-11 From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework Yunkai Hu et.al. 2508.08147
2025-08-09 Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud Jinyuan Chen et.al. 2508.06948
2025-08-06 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay Jun Zhang et.al. 2412.04449_(ICC)
2025-08-06 StackPilot: Autonomous Function Agents for Scalable and Environment-Free Code Execution Xinkui Zhao et.al. 2508.11665
2025-08-06 AquaChat++: LLM-Assisted Multi-ROV Inspection for Aquaculture Net Pens with Integrated Battery Management and Thruster Fault Tolerance Abdelhaleem Saad et.al. 2508.06554
2025-08-05 REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks Longling Geng et.al. 2502.18836
2025-08-04 CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation Xiaolin Lin et.al. 2508.02401
2025-08-01 CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization Yuning Jiang et.al. 2508.00478
2025-07-30 A Survey on Large Language Model Acceleration based on KV Cache Management Haoyang Li et.al. 2412.19442
2025-07-29 Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling Rajeev Patwari et.al. 2508.00904
2025-07-29 StaffPro: an LLM Agent for Joint Staffing and Profiling Alessio Maritan et.al. 2507.21636
2025-07-26 FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression Runchao Li et.al. 2507.20030
2025-07-25 Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding StepFun et.al. 2507.19427
2025-07-24 NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database Weizhi Fei et.al. 2507.18028
2025-07-22 Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning Hongyin Luo et.al. 2507.16784
2025-07-21 LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra Seth Karten et.al. 2507.15815
2025-07-18 DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation Ziqi Wang et.al. 2507.14267
2025-07-18 CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education Jianing Zhao et.al. 2507.13814
2025-07-14 InstCache: A Predictive Cache for LLM Serving Longwei Zou et.al. 2411.13820
2025-07-14 DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving Yuhan Liu et.al. 2411.02820
2025-07-10 Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores Vivek Chari et.al. 2507.08143
2025-07-10 KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows Zaifeng Pan et.al. 2507.07400
2025-07-09 Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration Xinyuan Song et.al. 2507.06520
2025-07-08 OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety Sanidhya Vijayvargiya et.al. 2507.06134
2025-07-07 StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling Meng Wei et.al. 2507.05240
2025-07-04 Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought Tencent Hunyuan Team et.al. 2505.15431
2025-07-01 VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator Zhican Wang et.al. 2507.00797_(DAC)
2025-07-01 EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens Chaoqun Yang et.al. 2507.00715_(KDD)
2025-06-30 Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC Xinming Wei et.al. 2506.24045
2025-06-30 RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference Yaoqi Chen et.al. 2505.02922
2025-06-28 Efficiently Serving Large Multimodal Models Using EPD Disaggregation Gursimran Singh et.al. 2501.05460
2025-06-28 FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets Shrenik Jadhav et.al. 2506.22708
2025-06-27 Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference Yaohua Tang et.al. 2502.15294
2025-06-26 CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation Nicolas Bougie et.al. 2506.21805
2025-06-26 MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models Yifan Liu et.al. 2506.21784
2025-06-25 MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation Gurusha Juneja et.al. 2506.20737
2025-06-23 RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding Guanzheng Chen et.al. 2502.20330_(ICML)
2025-06-18 eLLM: Elastic Memory Management Framework for Efficient LLM Serving Jiale Xu et.al. 2506.15155
2025-06-18 Moment Sampling in Video LLMs for Long-Form Video QA Mustafa Chasmai et.al. 2507.00033_(CVPR)
2025-06-17 LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification Penghui Yang et.al. 2502.17421
2025-06-16 AlphaEvolve: A coding agent for scientific and algorithmic discovery Alexander Novikov et.al. 2506.13131
2025-06-14 ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression Guangda Liu et.al. 2412.03213
2025-06-13 FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference Runheng Liu et.al. 2405.04065_(ACL)
2025-06-12 SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding Ziyi Zhang et.al. 2506.11309
2025-06-12 SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models Kaushal Kumar Maurya et.al. 2408.08545
2025-06-11 SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems Peiran Li et.al. 2506.07564
2025-06-09 DeepServe: Serverless Large Language Model Serving at Scale Junhao Hu et.al. 2501.14417
2025-06-08 MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache Akshat Sharma et.al. 2411.18077
2025-06-07 EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments Sara Fish et.al. 2503.18825
2025-06-05 Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback Junior Cedric Tonga et.al. 2506.04920_(ISS)
2025-06-04 KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation Chaoyi Jiang et.al. 2411.17089_(ACL)
2025-06-04 HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing Minghui Liu et.al. 2412.16187
2025-06-04 AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance Dhaval Patel et.al. 2506.03828
2025-06-03 A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization Junhui He et.al. 2502.12665
2025-06-03 SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation Jialong Wu et.al. 2412.13649_(ACL)
2025-06-02 SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation Aurick Qiao et.al. 2410.03960
2025-06-01 A Survey of LLM $\times$ DATA Xuanhe Zhou et.al. 2505.18458
2025-05-30 RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning Junhao Hu et.al. 2502.11147
2025-05-30 Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding Feiyu Yao et.al. 2506.15704
2025-05-29 EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse Tianyu Guo et.al. 2505.21889_(DIS)
2025-05-28 gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling Tianyu Guo et.al. 2504.14775
2025-05-28 KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization Coleman Hooper et.al. 2401.18079_(NeurIPS)
2025-05-28 Design and testing of an agent chatbot supporting decision making with public transport data Luca Fantin et.al. 2505.22698
2025-05-27 Hardware-Efficient Attention for Fast Decoding Ted Zadouri et.al. 2505.21487
2025-05-27 TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization Dingyu Yao et.al. 2505.19586
2025-05-27 EPIC: Efficient Position-Independent Caching for Serving Large Language Models Junhao Hu et.al. 2410.15332
2025-05-26 PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving Ahmet Caner Yüzügüler et.al. 2501.08192
2025-05-26 BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems Yuxin Wang et.al. 2401.17644
2025-05-26 Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents Ye Ye et.al. 2505.19436
2025-05-24 PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs Tengxuan Liu et.al. 2505.18610
2025-05-23 Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence Amirhosein Ghasemabadi et.al. 2505.20325
2025-05-23 ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy Gengyang Li et.al. 2505.15684
2025-05-23 Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation Yuelyu Ji et.al. 2505.17391
2025-05-23 Boosting Long-Context Management via Query-Guided Activation Refilling Hongjin Qian et.al. 2412.12486_(ACL)
2025-05-23 Mitigate Position Bias in Large Language Models via Scaling a Single Dimension Yijiong Yu et.al. 2406.02536_(ACL)
2025-05-21 Can LLMs Maintain Fundamental Abilities under KV Cache Compression? Xiang Liu et.al. 2502.01941
2025-05-21 LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval Zhenyu Ning et.al. 2505.15269
2025-05-20 CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration Pengyan Zhu et.al. 2505.14085
2025-05-20 Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation Peter Baile Chen et.al. 2505.14398
2025-05-20 Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding Sakhinana Sagar Srinivas et.al. 2504.01281
2025-05-19 SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache Qiuyu Zhu et.al. 2505.10951
2025-05-19 Learning Virtual Machine Scheduling in Cloud Computing through Language Agents JieHao Wu et.al. 2505.10117
2025-05-18 ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning Edward Y. Chang et.al. 2505.12501
2025-05-17 Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents Tiannuo Yang et.al. 2505.12065
2025-05-17 OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents Raghav Thind et.al. 2504.16918
2025-05-16 KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse Huan Yang et.al. 2503.16525
2025-05-14 Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization Minsu Kim et.al. 2503.18599
2025-05-13 Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs Lucas Maisonnave et.al. 2504.13989
2025-05-12 SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models Hang Wu et.al. 2505.07680
2025-05-12 PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications Kuntai Du et.al. 2505.07203
2025-05-09 Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM Zehao Fan et.al. 2505.05772
2025-05-01 QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Yujun Lin et.al. 2405.04532
2025-04-28 semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage Ke Hong et.al. 2504.19867
2025-04-25 ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference Hanshi Sun et.al. 2410.21465
2025-04-24 L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference Qingyuan Liu et.al. 2504.17584
2025-04-24 Tempo: Application-aware LLM Serving with Mixed SLO Requirements Wei Zhang et.al. 2504.20068
2025-04-24 Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents Yueying Li et.al. 2504.07347
2025-04-22 Optimizing SLO-oriented LLM Serving with PD-Multiplexing Weihao Cui et.al. 2504.14489
2025-04-21 LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Shang Yang et.al. 2502.14866
2025-04-21 FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Zihao Ye et.al. 2501.01005
2025-04-21 PLANET: A Collection of Benchmarks for Evaluating LLMs’ Planning Capabilities Haoming Li et.al. 2504.14773
2025-04-19 Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management Hang Zhang et.al. 2505.03756
2025-04-16 Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading Kihyun Kim et.al. 2504.11816
2025-04-16 Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs Hyungwoo Lee et.al. 2504.11765
2025-04-14 AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference Yangshen Deng et.al. 2504.10326
2025-04-13 Efficient LLM Serving on Hybrid Real-time and Best-effort Requests Wan Borui et.al. 2504.09590
2025-04-13 Block-Attention for Efficient Prefilling Dongyang Ma et.al. 2409.15355_(ICLR)
2025-04-10 Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving Shihong Gao et.al. 2504.07494
2025-04-09 Optimizing LLM Queries in Relational Data Analytics Workloads Shu Liu et.al. 2403.05821
2025-04-09 MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation Hongjin Qian et.al. 2409.05591_(TheWebConf)
2025-04-03 CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion Jiayi Yao et.al. 2405.16444
2025-04-03 HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse Yuwei An et.al. 2504.02921
2025-04-02 MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding Ranajoy Sadhukhan et.al. 2408.11049
2025-04-01 Personality-Driven Decision-Making in LLM-Based Autonomous Agents Lewis Newsham et.al. 2504.00727
2025-04-01 HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents Shiyi Liu et.al. 2504.00434
2025-03-31 Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving Wei Gao et.al. 2503.24000
2025-03-31 Training-Free Exponential Context Extension via Cascading KV Cache Jeffrey Willette et.al. 2406.17808
2025-03-25 Agent-Initiated Interaction in Phone UI Automation Noam Kahlon et.al. 2503.19537
2025-03-24 Mitigating KV Cache Competition to Enhance User Experience in LLM Inference Haiying Shen et.al. 2503.13773
2025-03-19 Exploring Large Language Models for Word Games:Who is the Spy? Chentian Wei et.al. 2503.15235
2025-03-12 COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation Di Zhao et.al. 2503.09263
2025-03-11 FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework Jianian Zhu et.al. 2503.08461
2025-03-11 LLM4MAC: An LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence Renxuan Tan et.al. 2503.08123
2025-03-11 Agent-Oriented Planning in Multi-Agent Systems Ao Li et.al. 2410.02189_(ICLR)
2025-03-11 SCBench: A KV Cache-Centric Analysis of Long-Context Methods Yucheng Li et.al. 2412.10319_(ICLR)
2025-03-10 Queueing, Predictions, and LLMs: Challenges and Open Problems Michael Mitzenmacher et.al. 2503.07545
2025-03-10 DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems Junwei Yu et.al. 2503.07675
2025-03-10 TokenButler: Token Importance is Predictable Yash Akhauri et.al. 2503.07518
2025-03-09 Seesaw: High-throughput LLM Inference via Model Re-sharding Qidong Su et.al. 2503.06433
2025-03-07 DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference Jinwei Yao et.al. 2404.00242_(DATE)
2025-03-06 LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression Souvik Kundu et.al. 2503.04982_(ACL)
2025-03-06 Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning Giulio Corallo et.al. 2503.04973
2025-03-06 Markov Chain of Thought for Efficient Mathematical Reasoning Wen Yang et.al. 2410.17635_(ACL)
2025-03-06 DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) Zongxin Yang et.al. 2401.08392
2025-03-05 Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line Muhammad Waseem et.al. 2503.03889
2025-03-04 Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression Nathan Godey et.al. 2503.02812
2025-03-03 WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models Jian Yuan et.al. 2503.01330_(ICASSP)
2025-03-01 Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving Qihui Zhou et.al. 2503.00392
2025-03-01 Streaming Video Question-Answering with In-context Video KV-Cache Retrieval Shangzhe Di et.al. 2503.00540_(ICLR)
2025-02-28 ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments Pedro Gimenes et.al. 2502.21208
2025-02-27 ThinK: Thinner Key Cache by Query-Driven Pruning Yuhui Xu et.al. 2407.21018_(ICLR)
2025-02-27 TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning Soumyabrata Chaudhuri et.al. 2502.20508
2025-02-27 EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance Yingxin Li et.al. 2412.08521
2025-02-24 ELMo-Tune-V2: LLM-Assisted Full-Cycle Auto-Tuning to Optimize LSM-Based Key-Value Stores Viraj Thakkar et.al. 2502.17606
2025-02-24 The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve? Zhenheng Tang et.al. 2502.17535
2025-02-22 AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure The AIBrix Team et.al. 2504.03648
2025-02-20 Compute Or Load KV Cache? Why Not Both? Shuowei Jin et.al. 2410.03065
2025-02-20 SpinQuant: LLM quantization with learned rotations Zechun Liu et.al. 2405.16406_(ICLR)
2025-02-20 Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents Axel Backlund et.al. 2502.15840
2025-02-20 Plan-over-Graph: Towards Parallelable LLM Agent Schedule Shiqi Zhang et.al. 2502.14563
2025-02-20 EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts Subhajit Chaudhury et.al. 2502.14280
2025-02-20 More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression Jiebin Zhang et.al. 2412.12706
2025-02-19 Autellix: An Efficient Serving Engine for LLM Agents as General Programs Michael Luo et.al. 2502.13965
2025-02-19 Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference Qingfa Xiao et.al. 2502.13542
2025-02-17 Does RAG Really Perform Bad For Long-Context Processing? Kun Luo et.al. 2502.11444
2025-02-16 An Intelligent Agentic System for Complex Image Restoration Problems Kaiwen Zhu et.al. 2410.17809_(ICLR)
2025-02-16 CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation Kun-Hui Lee et.al. 2502.11101
2025-02-11 HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment Youhe Jiang et.al. 2502.07903_(ICLR)
2025-02-06 Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents Chenyang Shao et.al. 2502.04392
2025-02-05 Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring Dongyoung Lee et.al. 2501.13331
2025-02-05 Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation Shubham Agarwal et.al. 2502.15734_(SIGMOD)
2025-02-04 LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation Xuan Zhang et.al. 2410.13846
2025-02-02 RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations Zunhai Su et.al. 2501.16383
2025-01-29 vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention Ramya Prabhu et.al. 2405.04437_(ASPLOS)
2025-01-29 MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning Edward Y. Chang et.al. 2501.16689
2025-01-27 PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization Mengzhao Chen et.al. 2410.05265
2025-01-27 LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System Tianfu Wang et.al. 2501.15749_(WWW)
2025-01-25 Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads Xingyang He et.al. 2501.15113
2025-01-23 A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention Heejun Lee et.al. 2406.09827
2025-01-22 Yi-Lightning Technical Report Alan Wake et.al. 2412.01253
2025-01-17 BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching Zhen Zheng et.al. 2412.03594
2025-01-14 CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning Guoliang He et.al. 2501.08071_(CGO)
2025-01-12 Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management Liu Qianli et.al. 2501.06709
2025-01-06 The Power of Negative Zero: Datatype Customization for Quantized Large Language Models Yuzong Chen et.al. 2501.04052_(ISS)
2024-12-31 RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval Di Liu et.al. 2409.10516
2024-12-24 TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications Neiwen Ling et.al. 2412.18695
2024-12-23 Deliberation in Latent Space via Differentiable Cache Augmentation Luyang Liu et.al. 2412.17747
2024-12-22 VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things Yaoyao Zhong et.al. 2312.00401_(AAAI)
2024-12-21 MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool Cunchen Hu et.al. 2406.17565
2024-12-21 SYMPHONY: Improving Memory Management for LLM Inference Workloads Saurabh Agarwal et.al. 2412.16434
2024-12-18 MagicPIG: LSH Sampling for Efficient LLM Generation Zhuoming Chen et.al. 2410.16179
2024-12-17 A System for Microserving of LLMs Hongyi Jin et.al. 2412.12488
2024-12-16 CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation Hongxuan Zhang et.al. 2412.11741
2024-12-16 Steering Language Models with Game-Theoretic Solvers Ian Gemp et.al. 2402.01704
2024-12-15 LAW: Legal Agentic Workflows for Custody and Fund Services Contracts William Watson et.al. 2412.11063_(COLING)
2024-12-13 KVDirect: Distributed Disaggregated LLM Inference Shiyang Chen et.al. 2501.14743
2024-12-06 Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern Hongyin Tang et.al. 2412.04757
2024-12-05 A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts Suyu Ge et.al. 2410.01485
2024-11-27 FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving Ao Shen et.al. 2411.18424
2024-11-22 Rapid Integration of LLMs in Healthcare Raises Ethical Concerns: An Investigation into Deceptive Patterns in Social Robots Robert Ranisch et.al. 2410.00434
2024-11-14 Large Language Models for Power Scheduling: A User-Centric Approach Thomas Mongaillard et.al. 2407.00476
2024-11-08 Eigen Attention: Attention in Low-Rank Space for KV Cache Compression Utkarsh Saxena et.al. 2408.05646
2024-11-05 AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution Zhiqiang Xie et.al. 2411.03519
2024-11-05 SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction Shlomo Neuberger et.al. 2411.03397
2024-11-03 A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression Alessio Devoto et.al. 2406.11430_(EMNLP)
2024-11-02 NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference Xuanlin Jiang et.al. 2411.01142
2024-11-01 Understanding Communication Preferences of Information Workers in Engagement with Text-Based Conversational Agents Ananya Bhattacharjee et.al. 2410.20468
2024-10-31 ALISE: Accelerating Large Language Model Serving with Speculative Scheduling Youpeng Zhao et.al. 2410.23537_(ICC)
2024-10-25 Fast Inference for Augmented Large Language Models Rana Shahout et.al. 2410.18248
2024-10-21 Do Large Language Models Need a Content Delivery Network? Yihua Cheng et.al. 2409.13761
2024-10-17 LLoCO: Learning Long Contexts Offline Sijun Tan et.al. 2404.07979_(EMNLP)
2024-10-16 COMET: Towards Partical W4A4KV4 LLMs Serving Lian Liu et.al. 2410.12168
2024-10-14 DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads Guangxuan Xiao et.al. 2410.10819
2024-10-11 OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents Yuwei Yan et.al. 2410.21286
2024-10-09 LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management Yi Xiong et.al. 2410.00428
2024-10-08 KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches Jiayi Yuan et.al. 2407.01527
2024-10-07 Fast State Restoration in LLM Serving with HCache Shiwei Gao et.al. 2410.05004_(EuroSys)
2024-10-07 KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head Isaac Rehg et.al. 2410.00161
2024-10-06 SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance Connor Walker et.al. 2410.10852
2024-10-04 LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy Rongzhi Zhang et.al. 2410.03111
2024-10-03 Preble: Efficient Distributed Prompt Scheduling for LLM Serving Vikranth Srivatsa et.al. 2407.00023
2024-10-03 Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1 Karthik Valmeekam et.al. 2410.02162
2024-09-23 BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models Bodun Hu et.al. 2404.18322
2024-09-16 Scalable Differential Privacy Mechanisms for Real-Time Machine Learning Applications Jessica Smith et.al. 2410.02462
2024-09-11 Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU Zhenyu Ning et.al. 2409.09086
2024-08-05 SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving Andreas Kosmas Kakolyris et.al. 2408.05235
2024-08-04 TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding Hanshi Sun et.al. 2404.11912
2024-08-01 ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition Lu Ye et.al. 2402.15220_(ACL)
2024-07-26 Collaborative Evolving Strategy for Automatic Data-Centric Development Xu Yang et.al. 2407.18690
2024-07-25 KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache Zirui Liu et.al. 2402.02750_(ICML)
2024-07-23 Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference Piotr Nawrot et.al. 2403.09636
2024-07-22 vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving Jiale Xu et.al. 2407.15309
2024-07-22 RazorAttention: Efficient KV Cache Compression Through Retrieval Heads Hanlin Tang et.al. 2407.15891
2024-07-21 Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks Zheng Wang et.al. 2407.08454
2024-07-19 CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving Yuhan Liu et.al. 2310.07240_(SIGCOMM)
2024-07-18 QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead Amir Zandieh et.al. 2406.03482
2024-07-11 Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs Ben Athiwaratkun et.al. 2403.08845
2024-06-30 Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention Bin Gao et.al. 2403.19708_(ATC)
2024-06-28 InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management Wonbeom Lee et.al. 2406.19707_(OSDI)
2024-06-16 EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism Yanxi Chen et.al. 2312.04916_(ICML)
2024-06-08 QCQA: Quality and Capacity-aware grouped Query Attention Vinay Joshi et.al. 2406.10247
2024-06-06 SGLang: Efficient Execution of Structured Language Model Programs Lianmin Zheng et.al. 2312.07104
2024-05-13 Hydragen: High-Throughput LLM Inference with Shared Prefixes Jordan Juravsky et.al. 2402.05099
2024-05-06 Federated Reinforcement Learning with Constraint Heterogeneity Hao Jin et.al. 2405.03236
2024-05-01 Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing KV Aditya Srivatsa et.al. 2405.00467_(ACL)
2024-04-15 Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models Siyan Zhao et.al. 2404.09529
2024-04-06 The Case for Developing a Foundation Model for Planning-like Tasks from Scratch Biplav Srivastava et.al. 2404.04540
2024-03-26 ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching Youpeng Zhao et.al. 2403.17312_(ISCA)
2024-03-18 FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines Jiaao He et.al. 2403.11421
2024-03-04 DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving Foteini Strati et.al. 2403.01876
2024-03-04 LLM-based Smart Reply (LSR): Enhancing Collaborative Performance with ChatGPT-mediated Smart Reply System Ashish Bastola et.al. 2306.11980
2024-02-04 Conversational Crowdsensing: A Parallel Intelligence Powered Novel Sensing Approach Zhengqiu Zhu et.al. 2402.06654
2024-01-20 On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS) Vishal Pallagani et.al. 2401.02500
2023-12-26 Natural Language based Context Modeling and Reasoning for Ubiquitous Computing with Large Language Models: A Tutorial Haoyi Xiong et.al. 2309.15074
2023-11-09 Towards A Natural Language Interface for Flexible Multi-Agent Task Assignment Jake Brawer et.al. 2311.00153
2023-10-30 SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models Hongxin Li et.al. 2305.19308_(NeurIPS)
2023-09-19 MindAgent: Emergent Gaming Interaction Ran Gong et.al. 2309.09971
2023-09-12 Efficient Memory Management for Large Language Model Serving with PagedAttention Woosuk Kwon et.al. 2309.06180_(SOSP)
2023-06-09 S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput Yunho Jin et.al. 2306.06000

This site uses Just the Docs, a documentation theme for Jekyll.