LLM Papers

Updated on 2025.10.22

Publish Date	Title	Authors	PDF
2025-10-21	The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems	Linke Song et.al.	2409.20002_(ICS)
2025-10-21	Reasoning Language Model Inference Serving Unveiled: An Empirical Study	Qi Li et.al.	2510.18672
2025-10-21	Tokencake: A KV-Cache-centric Serving Framework for LLM-based Multi-Agent Applications	Zhuohang Bian et.al.	2510.18586
2025-10-21	EfficientNav: Towards On-Device Object-Goal Navigation with Navigation Map Caching and Retrieval	Zebin Yang et.al.	2510.18546_(NeurIPS)
2025-10-21	StreamingTOM: Streaming Token Compression for Efficient Video Understanding	Xueyi Chen et.al.	2510.18269
2025-10-21	The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability	Zijie Xu et.al.	2510.18563
2025-10-20	Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution	Zhiyang Chen et.al.	2510.15312
2025-10-20	Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models	Elias Hossain et.al.	2510.17098
2025-10-20	StreamingThinker: Large Language Models Can Think While Reading	Junlong Tong et.al.	2510.17238
2025-10-19	STARK: Strategic Team of Agents for Refining Kernels	Juncheng Dong et.al.	2510.16996
2025-10-18	Ripple Effect Protocol: Coordinating Agent Populations	Ayush Chopra et.al.	2510.16572
2025-10-16	xLLM Technical Report	Tongxuan Liu et.al.	2510.14686
2025-10-16	Attention Is All You Need for KV Cache in Diffusion LLMs	Quan Nguyen-Tri et.al.	2510.14973
2025-10-16	Terrarium: Revisiting the Blackboard for Multi-Agent Safety, Privacy, and Security Studies	Mason Nakamura et.al.	2510.14312
2025-10-16	Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing	Tianhua Xia et.al.	2510.16040
2025-10-15	LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning	Haoyue Zhang et.al.	2506.15969
2025-10-15	BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure	Yiyuan He et.al.	2510.13223
2025-10-15	Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention	Zhen Yang et.al.	2510.13940
2025-10-15	Cortex: Workflow-Aware Resource Pooling and Scheduling for Agentic Serving	Nikos Pagonas et.al.	2510.14126
2025-10-14	Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models	Rabimba Karanjai et.al.	2510.12080
2025-10-14	KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems	Hancheng Ye et.al.	2510.12872_(NeurIPS)
2025-10-13	MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE	Soheil Zibakhsh et.al.	2509.17238
2025-10-13	Not All Bits Are Equal: Scale-Dependent Memory Optimization Strategies for Reasoning Models	Junhyuck Kim et.al.	2510.10964
2025-10-13	Part II: ROLL Flash – Accelerating RLVR and Agentic Training with Asynchrony	Han Lu et.al.	2510.11345
2025-10-11	CacheClip: Accelerating RAG with Effective KV Cache Reuse	Bin Yang et.al.	2510.10129
2025-10-11	Agentic Troubleshooting Guide Automation for Incident Management	Jiayi Mao et.al.	2510.10074
2025-10-10	OrcaLoca: An LLM Agent Framework for Software Issue Localization	Zhongming Yu et.al.	2502.00350
2025-10-09	Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval	Wenhao Li et.al.	2508.19740
2025-10-08	AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs	Peize He et.al.	2510.07293
2025-10-08	AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding	Shuqing Luo et.al.	2510.07486
2025-10-08	FLEET: Formal Language-Grounded Scheduling for Heterogeneous Robot Teams	Corban Rivera et.al.	2510.07417
2025-10-08	LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference	Yihua Cheng et.al.	2510.09665
2025-10-07	VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization	Dingyu Yao et.al.	2510.06175
2025-10-07	H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference	Harshil Vejendla et.al.	2510.05529
2025-10-06	ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering	Yuki Imajuku et.al.	2506.09050_(NeurIPS)
2025-10-06	Multi-Agent Collaborative Intelligence: Dual-Dial Control for Reliable LLM Reasoning	Edward Y. Chang et.al.	2510.04488
2025-10-03	TokenFlow: Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling	Junyi Chen et.al.	2510.02758_(EuroSys)
2025-10-03	Automatic Building Code Review: A Case Study	Hanlong Wan et.al.	2510.02634
2025-10-02	QSpec: Speculative Decoding with Complementary Quantization Schemes	Juntao Zhao et.al.	2410.11305
2025-10-02	KaVa: Latent Reasoning via Compressed KV-Cache Distillation	Anna Kuzina et.al.	2510.02312
2025-10-02	ElasticMoE: An Efficient Auto Scaling Method for Mixture-of-Experts Models	Gursimran Singh et.al.	2510.02613_(ISS)
2025-10-02	KVComm: Enabling Efficient LLM Communication through Selective KV Sharing	Xiangyu Shi et.al.	2510.03346
2025-10-01	AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size	Guanxi Lu et.al.	2509.26432
2025-10-01	Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies	Kyoungmin Kim et.al.	2411.07447
2025-09-30	KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction	Jang-Hyun Kim et.al.	2505.23416_(NeurIPS)
2025-09-30	AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges	Ranjan Sapkota et.al.	2505.10468
2025-09-30	Towards Agentic OS: An LLM Agent Framework for Linux Schedulers	Yusheng Zheng et.al.	2509.01245
2025-09-29	Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models	Keda Tao et.al.	2503.16257
2025-09-29	KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation	Ching Han Chen et.al.	2505.07618
2025-09-29	SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences	Jungyoub Cha et.al.	2505.20776
2025-09-29	SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching	Yuxuan Zhu et.al.	2504.00970
2025-09-29	SparseServe: Unlocking Parallelism for Dynamic Sparse Attention in Long-Context LLM Serving	Qihui Zhou et.al.	2509.24626
2025-09-29	SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching	Xinye Zhao et.al.	2509.24832
2025-09-29	SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents	Gyuhyeon Seo et.al.	2509.24282
2025-09-28	From An LLM Swarm To A PDDL-Empowered HIVE: Planning Self-Executed Instructions In A Multi-Modal Jungle	Kaustubh Vyas et.al.	2412.12839_(ICLR)
2025-09-28	HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models	Zhinan Xie et.al.	2509.23928
2025-09-27	Runtime Adaptive Pruning for LLM Inference	Huanrong Liu et.al.	2505.17138
2025-09-27	ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration	Xianglong Yan et.al.	2505.24357
2025-09-27	READER: Retrieval-Assisted Drafter for Efficient LLM Inference	Maxim Divilkovskiy et.al.	2508.09072
2025-09-26	KV Cache Steering for Controlling Frozen LLMs	Max Belitsky et.al.	2507.08799
2025-09-26	Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning	Adnan Oomerjee et.al.	2505.16950
2025-09-26	ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration	Gaole Dai et.al.	2509.21823
2025-09-26	Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents	Heyang Gao et.al.	2510.03253
2025-09-26	LLM Assisted Alpha Fairness for 6 GHz WiFi and NR_U Coexistence: An Agentic Orchestrator for Throughput, Energy, and SLA	Qun Wang et.al.	2510.17814
2025-09-25	HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling	Zahra Yousefijamarani et.al.	2508.15919
2025-09-25	InComeS: Integrating Compression and Selection Mechanisms into LLMs for Efficient Model Editing	Shuaiyi Li et.al.	2505.22156
2025-09-25	Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization	Yuhang Xu et.al.	2509.21301
2025-09-24	UNComp: Can Matrix Entropy Uncover Sparsity? – A Compressor Design from an Uncertainty-Aware Perspective	Jing Xiong et.al.	2410.03090_(EMNLP)
2025-09-24	Gyges: Dynamic Cross-Instance Parallelism Transformation for Efficient LLM Inference	Haoyu Chen et.al.	2509.19729
2025-09-24	CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks	Jiewei Chen et.al.	2509.19855
2025-09-22	A Large Language Model-based multi-agent manufacturing system for intelligent shopfloor	Zhen Zhao et.al.	2405.16887
2025-09-22	Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings	Stephen Zhang et.al.	2502.00919
2025-09-22	Asteria: Semantic-Aware Cross-Region Caching for Agentic LLM Tool Access	Chaoyi Ruan et.al.	2509.17360
2025-09-21	ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching	Xingyu Xiang et.al.	2509.16857
2025-09-20	EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs	Zhengge Cai et.al.	2509.16686
2025-09-20	Time to Talk: LLM Agents for Asynchronous Group Communication in Mafia Games	Niv Eckhaus et.al.	2506.05309
2025-09-19	Overhearing LLM Agents: A Survey, Taxonomy, and Roadmap	Andrew Zhu et.al.	2509.16325
2025-09-17	Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework	Kerui Huang et.al.	2509.14093
2025-09-17	CrowdAgent: Multi-Agent Managed Multi-Source Annotation System	Maosheng Qin et.al.	2509.14030
2025-09-16	FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference	Dongwei Wang et.al.	2508.08256_(EMNLP)
2025-09-15	Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System	Yunhua Fang et.al.	2508.13231_(CHI)
2025-09-15	FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving	Kyungmin Bin et.al.	2509.06261
2025-09-08	Amplifying Effective CXL Memory Bandwidth for LLM Inference via Transparent Near-Data Processing	Rui Xie et.al.	2509.03377
2025-09-08	Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models	Yinjie Wang et.al.	2509.06949
2025-09-03	ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving	Yifan Qiao et.al.	2410.01228
2025-09-02	Batch Query Processing and Optimization for Agentic Workflows	Junyi Shen et.al.	2509.02121
2025-09-01	TRACE-CS: A Hybrid Logic-LLM System for Explainable Course Scheduling	Stylianos Loukas Vasileiou et.al.	2409.03671
2025-09-01	LLMs cannot spot math errors, even when allowed to peek into the solution	KV Aditya Srivatsa et.al.	2509.01395_(EMNLP)
2025-08-31	LLM Serving Optimization with Variable Prefill and Decode Lengths	Meixuan Wang et.al.	2508.06133
2025-08-30	DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction	Yanqi Zhang et.al.	2412.03131_(SOSP)
2025-08-30	LLM-Assisted Iterative Evolution with Swarm Intelligence Toward SuperBrain	Li Weigang et.al.	2509.00510
2025-08-29	Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward	Yong Deng et.al.	2508.12800
2025-08-29	Democratizing Agentic AI with Fast Test-Time Scaling on the Edge	Hao Mark Chen et.al.	2509.00195
2025-08-28	AdaptCache: KV Cache Native Storage Hierarchy for Low-Delay and High-Quality Language Model Serving	Shaoting Feng et.al.	2509.00105
2025-08-28	TinyServe: Query-Aware Cache Selection for Efficient LLM Serving	Dong Liu et.al.	2509.12211_(ACM MM)
2025-08-26	Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing	Junyi Wen et.al.	2507.08045
2025-08-26	Strata: Hierarchical Context Caching for Long Context Language Model Serving	Zhiqiang Xie et.al.	2508.18572
2025-08-24	PRISM: Efficient Long-Range Reasoning With Short-Context LLMs	Dulhan Jayalath et.al.	2412.18914_(EMNLP)
2025-08-23	Learned Structure in CARTRIDGES: Keys as Shareable Routers in Self-Studied Representations	Maurizio Diaz et.al.	2508.17032
2025-08-21	Efficient Mixed-Precision Large Language Model Inference with TurboMind	Li Zhang et.al.	2508.15601
2025-08-20	H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference	Zizhuo Fu et.al.	2508.16653_(ICC)
2025-08-20	Entropy-Constrained Strategy Optimization in Urban Floods: A Multi-Agent Framework with LLM and Knowledge Graph Integration	Peilin Ji et.al.	2508.14654
2025-08-18	Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis	Ayoub Ben Chaliah et.al.	2508.13382
2025-08-17	ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads	Zhuorui Liu et.al.	2508.12407
2025-08-15	UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?	Mukund Choudhary et.al.	2508.11260
2025-08-14	SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression	Mengjie Li et.al.	2508.15806
2025-08-14	ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs	Keyu Chen et.al.	2508.08895
2025-08-14	FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference	Guangda Liu et.al.	2505.13109
2025-08-12	Retrospective Sparse Attention for Efficient Long-Context Generation	Seonghwan Choi et.al.	2508.09001
2025-08-12	Chimera: Harnessing Multi-Agent LLMs for Automatic Insider Threat Simulation	Jiongchi Yu et.al.	2508.07745
2025-08-12	AIOS: LLM Agent Operating System	Kai Mei et.al.	2403.16971
2025-08-11	Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories	Ming-Yen Lee et.al.	2508.08457
2025-08-11	From Natural Language to Solver-Ready Power System Optimization: An LLM-Assisted, Validation-in-the-Loop Framework	Yunkai Hu et.al.	2508.08147
2025-08-09	Kairos: Low-latency Multi-Agent Serving with Shared LLMs and Excessive Loads in the Public Cloud	Jinyuan Chen et.al.	2508.06948
2025-08-06	p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay	Jun Zhang et.al.	2412.04449_(ICC)
2025-08-06	StackPilot: Autonomous Function Agents for Scalable and Environment-Free Code Execution	Xinkui Zhao et.al.	2508.11665
2025-08-06	AquaChat++: LLM-Assisted Multi-ROV Inspection for Aquaculture Net Pens with Integrated Battery Management and Thruster Fault Tolerance	Abdelhaleem Saad et.al.	2508.06554
2025-08-05	REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks	Longling Geng et.al.	2502.18836
2025-08-04	CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation	Xiaolin Lin et.al.	2508.02401
2025-08-01	CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization	Yuning Jiang et.al.	2508.00478
2025-07-30	A Survey on Large Language Model Acceleration based on KV Cache Management	Haoyang Li et.al.	2412.19442
2025-07-29	Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling	Rajeev Patwari et.al.	2508.00904
2025-07-29	StaffPro: an LLM Agent for Joint Staffing and Profiling	Alessio Maritan et.al.	2507.21636
2025-07-26	FAEDKV: Infinite-Window Fourier Transform for Unbiased KV Cache Compression	Runchao Li et.al.	2507.20030
2025-07-25	Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding	StepFun et.al.	2507.19427
2025-07-24	NeuralDB: Scaling Knowledge Editing in LLMs to 100,000 Facts with Neural KV Database	Weizhi Fei et.al.	2507.18028
2025-07-23	KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider	Jiahao Wang et.al.	2506.02634_(ATC)
2025-07-22	Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning	Hongyin Luo et.al.	2507.16784
2025-07-21	LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra	Seth Karten et.al.	2507.15815
2025-07-19	KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse	Jingbo Yang et.al.	2502.16002
2025-07-18	DREAMS: Density Functional Theory Based Research Engine for Agentic Materials Simulation	Ziqi Wang et.al.	2507.14267
2025-07-18	CodeEdu: A Multi-Agent Collaborative Platform for Personalized Coding Education	Jianing Zhao et.al.	2507.13814
2025-07-15	MIRAGE: KV Cache Optimization through Parameter Remapping for Multi-tenant LLM Serving	Ruihao Li et.al.	2507.11507
2025-07-14	InstCache: A Predictive Cache for LLM Serving	Longwei Zou et.al.	2411.13820
2025-07-14	DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving	Yuhan Liu et.al.	2411.02820
2025-07-10	Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores	Vivek Chari et.al.	2507.08143
2025-07-10	KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows	Zaifeng Pan et.al.	2507.07400
2025-07-10	Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs	Jiakun Fan et.al.	2506.03296
2025-07-09	Gradientsys: A Multi-Agent LLM Scheduler with ReAct Orchestration	Xinyuan Song et.al.	2507.06520
2025-07-08	OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety	Sanidhya Vijayvargiya et.al.	2507.06134
2025-07-07	StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	Meng Wei et.al.	2507.05240
2025-07-04	Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought	Tencent Hunyuan Team et.al.	2505.15431
2025-07-01	VEDA: Efficient LLM Generation Through Voting-based KV Cache Eviction and Dataflow-flexible Accelerator	Zhican Wang et.al.	2507.00797_(DAC)
2025-07-01	EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens	Chaoqun Yang et.al.	2507.00715_(KDD)
2025-06-30	Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC	Xinming Wei et.al.	2506.24045
2025-06-30	RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference	Yaoqi Chen et.al.	2505.02922
2025-06-28	Efficiently Serving Large Multimodal Models Using EPD Disaggregation	Gursimran Singh et.al.	2501.05460
2025-06-28	FairMarket-RL: LLM-Guided Fairness Shaping for Multi-Agent Reinforcement Learning in Peer-to-Peer Markets	Shrenik Jadhav et.al.	2506.22708
2025-06-27	Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference	Yaohua Tang et.al.	2502.15294
2025-06-26	CitySim: Modeling Urban Behaviors and City Dynamics with Large-Scale LLM-Driven Agent Simulation	Nicolas Bougie et.al.	2506.21805
2025-06-26	MobiVerse: Scaling Urban Mobility Simulation with Hybrid Lightweight Domain-Specific Generator and Large Language Models	Yifan Liu et.al.	2506.21784
2025-06-25	MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation	Gurusha Juneja et.al.	2506.20737
2025-06-23	RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding	Guanzheng Chen et.al.	2502.20330_(ICML)
2025-06-18	eLLM: Elastic Memory Management Framework for Efficient LLM Serving	Jiale Xu et.al.	2506.15155
2025-06-18	Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations	Amey Agrawal et.al.	2409.17264
2025-06-18	Moment Sampling in Video LLMs for Long-Form Video QA	Mustafa Chasmai et.al.	2507.00033_(CVPR)
2025-06-17	LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification	Penghui Yang et.al.	2502.17421
2025-06-16	AlphaEvolve: A coding agent for scientific and algorithmic discovery	Alexander Novikov et.al.	2506.13131
2025-06-14	ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression	Guangda Liu et.al.	2412.03213
2025-06-13	FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference	Runheng Liu et.al.	2405.04065_(ACL)
2025-06-12	SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding	Ziyi Zhang et.al.	2506.11309
2025-06-12	SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models	Kaushal Kumar Maurya et.al.	2408.08545
2025-06-11	SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems	Peiran Li et.al.	2506.07564
2025-06-09	Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models	Haoyu Wang et.al.	2506.07334
2025-06-09	DeepServe: Serverless Large Language Model Serving at Scale	Junhao Hu et.al.	2501.14417
2025-06-08	MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache	Akshat Sharma et.al.	2411.18077
2025-06-07	EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments	Sara Fish et.al.	2503.18825
2025-06-05	Inference-Time Hyper-Scaling with KV Cache Compression	Adrian Łańcucki et.al.	2506.05345
2025-06-05	Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback	Junior Cedric Tonga et.al.	2506.04920_(ISS)
2025-06-04	KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation	Chaoyi Jiang et.al.	2411.17089_(ACL)
2025-06-04	HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing	Minghui Liu et.al.	2412.16187
2025-06-04	AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance	Dhaval Patel et.al.	2506.03828
2025-06-03	A$^2$ATS: Retrieval-Based KV Cache Reduction via Windowed Rotary Position Embedding and Query-Aware Vector Quantization	Junhui He et.al.	2502.12665
2025-06-03	SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation	Jialong Wu et.al.	2412.13649_(ACL)
2025-06-02	SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation	Aurick Qiao et.al.	2410.03960
2025-06-01	A Survey of LLM $\times$ DATA	Xuanhe Zhou et.al.	2505.18458
2025-05-31	KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference	Xing Li et.al.	2502.04420_(ICML)
2025-05-30	SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference	Tian Xia et.al.	2505.24095
2025-05-30	HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts	Neil He et.al.	2505.24722
2025-05-30	RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning	Junhao Hu et.al.	2502.11147
2025-05-30	Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding	Feiyu Yao et.al.	2506.15704
2025-05-29	EmbAdvisor: Adaptive Cache Management for Sustainable LLM Serving	Yuyang Tian et.al.	2505.23970
2025-05-29	EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse	Tianyu Guo et.al.	2505.21889_(DIS)
2025-05-28	gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling	Tianyu Guo et.al.	2504.14775
2025-05-28	KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization	Coleman Hooper et.al.	2401.18079_(NeurIPS)
2025-05-28	Design and testing of an agent chatbot supporting decision making with public transport data	Luca Fantin et.al.	2505.22698
2025-05-27	Hardware-Efficient Attention for Fast Decoding	Ted Zadouri et.al.	2505.21487
2025-05-27	TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization	Dingyu Yao et.al.	2505.19586
2025-05-27	EPIC: Efficient Position-Independent Caching for Serving Large Language Models	Junhao Hu et.al.	2410.15332
2025-05-26	PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving	Ahmet Caner Yüzügüler et.al.	2501.08192
2025-05-26	BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems	Yuxin Wang et.al.	2401.17644
2025-05-26	Task Memory Engine: Spatial Memory for Robust Multi-Step LLM Agents	Ye Ye et.al.	2505.19436
2025-05-24	Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing	Zhaoyuan Su et.al.	2506.02006
2025-05-24	PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs	Tengxuan Liu et.al.	2505.18610
2025-05-23	FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding	Zhibin Wang et.al.	2505.17694
2025-05-23	Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence	Amirhosein Ghasemabadi et.al.	2505.20325
2025-05-23	ThinkLess: A Training-Free Inference-Efficient Method for Reducing Reasoning Redundancy	Gengyang Li et.al.	2505.15684
2025-05-23	Hogwild! Inference: Parallel LLM Generation via Concurrent Attention	Gleb Rodionov et.al.	2504.06261
2025-05-23	Curriculum Guided Reinforcement Learning for Efficient Multi Hop Retrieval Augmented Generation	Yuelyu Ji et.al.	2505.17391
2025-05-23	Boosting Long-Context Management via Query-Guided Activation Refilling	Hongjin Qian et.al.	2412.12486_(ACL)
2025-05-23	Mitigate Position Bias in Large Language Models via Scaling a Single Dimension	Yijiong Yu et.al.	2406.02536_(ACL)
2025-05-21	Can LLMs Maintain Fundamental Abilities under KV Cache Compression?	Xiang Liu et.al.	2502.01941
2025-05-21	LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval	Zhenyu Ning et.al.	2505.15269
2025-05-20	CE-LSLM: Efficient Large-Small Language Model Inference and Communication via Cloud-Edge Collaboration	Pengyan Zhu et.al.	2505.14085
2025-05-20	Online Scheduling for LLM Inference with KV Cache Constraints	Patrick Jaillet et.al.	2502.07115
2025-05-20	Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation	Peter Baile Chen et.al.	2505.14398
2025-05-20	Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning	Jiwon Song et.al.	2505.13866
2025-05-20	KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments	Junyoung Park et.al.	2504.15364
2025-05-20	Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding	Sakhinana Sagar Srinivas et.al.	2504.01281
2025-05-19	SubGCache: Accelerating Graph-based RAG with Subgraph-level KV Cache	Qiuyu Zhu et.al.	2505.10951
2025-05-19	Learning Virtual Machine Scheduling in Cloud Computing through Language Agents	JieHao Wu et.al.	2505.10117
2025-05-18	ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning	Edward Y. Chang et.al.	2505.12501
2025-05-17	Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents	Tiannuo Yang et.al.	2505.12065
2025-05-17	OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents	Raghav Thind et.al.	2504.16918
2025-05-16	KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse	Huan Yang et.al.	2503.16525
2025-05-14	Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization	Minsu Kim et.al.	2503.18599
2025-05-13	Gradual Binary Search and Dimension Expansion : A general method for activation quantization in LLMs	Lucas Maisonnave et.al.	2504.13989
2025-05-12	SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models	Hang Wu et.al.	2505.07680
2025-05-12	PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications	Kuntai Du et.al.	2505.07203
2025-05-09	Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM	Zehao Fan et.al.	2505.05772
2025-05-08	HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflow	You Peng et.al.	2505.05286
2025-05-01	QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving	Yujun Lin et.al.	2405.04532
2025-04-28	semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage	Ke Hong et.al.	2504.19867
2025-04-25	ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference	Hanshi Sun et.al.	2410.21465
2025-04-24	L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference	Qingyuan Liu et.al.	2504.17584
2025-04-24	Tempo: Application-aware LLM Serving with Mixed SLO Requirements	Wei Zhang et.al.	2504.20068
2025-04-24	Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents	Yueying Li et.al.	2504.07347
2025-04-22	Optimizing SLO-oriented LLM Serving with PD-Multiplexing	Weihao Cui et.al.	2504.14489
2025-04-21	LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention	Shang Yang et.al.	2502.14866
2025-04-21	FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving	Zihao Ye et.al.	2501.01005
2025-04-21	PLANET: A Collection of Benchmarks for Evaluating LLMs’ Planning Capabilities	Haoming Li et.al.	2504.14773
2025-04-20	Understanding and Optimizing Multi-Stage AI Inference Pipelines	Abhimanyu Rajeshkumar Bambhaniya et.al.	2504.09775
2025-04-19	Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management	Hang Zhang et.al.	2505.03756
2025-04-16	Cost-Efficient LLM Serving in the Cloud: VM Selection with KV Cache Offloading	Kihyun Kim et.al.	2504.11816
2025-04-16	Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs	Hyungwoo Lee et.al.	2504.11765
2025-04-15	Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints	Ruicheng Ao et.al.	2504.11320
2025-04-14	AlayaDB: The Data Foundation for Efficient and Effective Long-context LLM Inference	Yangshen Deng et.al.	2504.10326
2025-04-13	Efficient LLM Serving on Hybrid Real-time and Best-effort Requests	Wan Borui et.al.	2504.09590
2025-04-13	Block-Attention for Efficient Prefilling	Dongyang Ma et.al.	2409.15355_(ICLR)
2025-04-10	Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving	Shihong Gao et.al.	2504.07494
2025-04-09	Optimizing LLM Queries in Relational Data Analytics Workloads	Shu Liu et.al.	2403.05821
2025-04-09	MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation	Hongjin Qian et.al.	2409.05591_(TheWebConf)
2025-04-08	Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching	Yanhao Dong et.al.	2504.06319
2025-04-03	CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion	Jiayi Yao et.al.	2405.16444
2025-04-03	HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse	Yuwei An et.al.	2504.02921
2025-04-02	MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding	Ranajoy Sadhukhan et.al.	2408.11049
2025-04-01	Personality-Driven Decision-Making in LLM-Based Autonomous Agents	Lewis Newsham et.al.	2504.00727
2025-04-01	HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents	Shiyi Liu et.al.	2504.00434
2025-03-31	Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving	Wei Gao et.al.	2503.24000
2025-03-31	Training-Free Exponential Context Extension via Cascading KV Cache	Jeffrey Willette et.al.	2406.17808
2025-03-25	Agent-Initiated Interaction in Phone UI Automation	Noam Kahlon et.al.	2503.19537
2025-03-24	Mitigating KV Cache Competition to Enhance User Experience in LLM Inference	Haiying Shen et.al.	2503.13773
2025-03-19	Exploring Large Language Models for Word Games:Who is the Spy?	Chentian Wei et.al.	2503.15235
2025-03-12	COLA: A Scalable Multi-Agent Framework For Windows UI Task Automation	Di Zhao et.al.	2503.09263
2025-03-11	FastCache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework	Jianian Zhu et.al.	2503.08461
2025-03-11	LLM4MAC: An LLM-Driven Reinforcement Learning Framework for MAC Protocol Emergence	Renxuan Tan et.al.	2503.08123
2025-03-11	Agent-Oriented Planning in Multi-Agent Systems	Ao Li et.al.	2410.02189_(ICLR)
2025-03-11	SCBench: A KV Cache-Centric Analysis of Long-Context Methods	Yucheng Li et.al.	2412.10319_(ICLR)
2025-03-10	Queueing, Predictions, and LLMs: Challenges and Open Problems	Michael Mitzenmacher et.al.	2503.07545
2025-03-10	DynTaskMAS: A Dynamic Task Graph-driven Framework for Asynchronous and Parallel LLM-based Multi-Agent Systems	Junwei Yu et.al.	2503.07675
2025-03-10	TokenButler: Token Importance is Predictable	Yash Akhauri et.al.	2503.07518
2025-03-09	Seesaw: High-throughput LLM Inference via Model Re-sharding	Qidong Su et.al.	2503.06433
2025-03-07	DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference	Jinwei Yao et.al.	2404.00242_(DATE)
2025-03-06	LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression	Souvik Kundu et.al.	2503.04982_(ACL)
2025-03-06	Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning	Giulio Corallo et.al.	2503.04973
2025-03-06	Markov Chain of Thought for Efficient Mathematical Reasoning	Wen Yang et.al.	2410.17635_(ACL)
2025-03-06	DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)	Zongxin Yang et.al.	2401.08392
2025-03-05	Pretrained LLMs as Real-Time Controllers for Robot Operated Serial Production Line	Muhammad Waseem et.al.	2503.03889
2025-03-04	Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression	Nathan Godey et.al.	2503.02812
2025-03-03	WeightedKV: Attention Scores Weighted Key-Value Cache Merging for Large Language Models	Jian Yuan et.al.	2503.01330_(ICASSP)
2025-03-01	Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving	Qihui Zhou et.al.	2503.00392
2025-03-01	Streaming Video Question-Answering with In-context Video KV-Cache Retrieval	Shangzhe Di et.al.	2503.00540_(ICLR)
2025-02-28	ARIES: Autonomous Reasoning with LLMs on Interactive Thought Graph Environments	Pedro Gimenes et.al.	2502.21208
2025-02-27	ThinK: Thinner Key Cache by Query-Driven Pruning	Yuhui Xu et.al.	2407.21018_(ICLR)
2025-02-27	TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning	Soumyabrata Chaudhuri et.al.	2502.20508
2025-02-27	EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance	Yingxin Li et.al.	2412.08521
2025-02-24	ELMo-Tune-V2: LLM-Assisted Full-Cycle Auto-Tuning to Optimize LSM-Based Key-Value Stores	Viraj Thakkar et.al.	2502.17606
2025-02-24	The Lottery LLM Hypothesis, Rethinking What Abilities Should LLM Compression Preserve?	Zhenheng Tang et.al.	2502.17535
2025-02-22	AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure	The AIBrix Team et.al.	2504.03648
2025-02-20	Compute Or Load KV Cache? Why Not Both?	Shuowei Jin et.al.	2410.03065
2025-02-20	SpinQuant: LLM quantization with learned rotations	Zechun Liu et.al.	2405.16406_(ICLR)
2025-02-20	Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents	Axel Backlund et.al.	2502.15840
2025-02-20	Plan-over-Graph: Towards Parallelable LLM Agent Schedule	Shiqi Zhang et.al.	2502.14563
2025-02-20	EpMAN: Episodic Memory AttentioN for Generalizing to Longer Contexts	Subhajit Chaudhury et.al.	2502.14280
2025-02-20	More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression	Jiebin Zhang et.al.	2412.12706
2025-02-19	Autellix: An Efficient Serving Engine for LLM Agents as General Programs	Michael Luo et.al.	2502.13965
2025-02-19	Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference	Qingfa Xiao et.al.	2502.13542
2025-02-17	Does RAG Really Perform Bad For Long-Context Processing?	Kun Luo et.al.	2502.11444
2025-02-16	An Intelligent Agentic System for Complex Image Restoration Problems	Kaiwen Zhu et.al.	2410.17809_(ICLR)
2025-02-16	CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation	Kun-Hui Lee et.al.	2502.11101
2025-02-11	HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment	Youhe Jiang et.al.	2502.07903_(ICLR)
2025-02-06	Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents	Chenyang Shao et.al.	2502.04392
2025-02-05	Qrazor: Reliable and Effortless 4-bit LLM Quantization by Significant Data Razoring	Dongyoung Lee et.al.	2501.13331
2025-02-05	Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation	Shubham Agarwal et.al.	2502.15734_(SIGMOD)
2025-02-04	LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation	Xuan Zhang et.al.	2410.13846
2025-02-02	RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations	Zunhai Su et.al.	2501.16383
2025-01-29	vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention	Ramya Prabhu et.al.	2405.04437_(ASPLOS)
2025-01-29	MACI: Multi-Agent Collaborative Intelligence for Adaptive Reasoning and Temporal Planning	Edward Y. Chang et.al.	2501.16689
2025-01-27	PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization	Mengzhao Chen et.al.	2410.05265
2025-01-27	LLM-powered Multi-agent Framework for Goal-oriented Learning in Intelligent Tutoring System	Tianfu Wang et.al.	2501.15749_(WWW)
2025-01-25	Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads	Xingyang He et.al.	2501.15113
2025-01-23	A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention	Heejun Lee et.al.	2406.09827
2025-01-22	Yi-Lightning Technical Report	Alan Wake et.al.	2412.01253
2025-01-17	BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching	Zhen Zheng et.al.	2412.03594
2025-01-14	CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning	Guoliang He et.al.	2501.08071_(CGO)
2025-01-12	Mell: Memory-Efficient Large Language Model Serving via Multi-GPU KV Cache Management	Liu Qianli et.al.	2501.06709
2025-01-06	The Power of Negative Zero: Datatype Customization for Quantized Large Language Models	Yuzong Chen et.al.	2501.04052_(ISS)
2024-12-31	RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval	Di Liu et.al.	2409.10516
2024-12-24	TimelyLLM: Segmented LLM Serving System for Time-sensitive Robotic Applications	Neiwen Ling et.al.	2412.18695
2024-12-23	Deliberation in Latent Space via Differentiable Cache Augmentation	Luyang Liu et.al.	2412.17747
2024-12-22	VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things	Yaoyao Zhong et.al.	2312.00401_(AAAI)
2024-12-21	MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool	Cunchen Hu et.al.	2406.17565
2024-12-21	SYMPHONY: Improving Memory Management for LLM Inference Workloads	Saurabh Agarwal et.al.	2412.16434
2024-12-18	MagicPIG: LSH Sampling for Efficient LLM Generation	Zhuoming Chen et.al.	2410.16179
2024-12-17	A System for Microserving of LLMs	Hongyi Jin et.al.	2412.12488
2024-12-16	CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation	Hongxuan Zhang et.al.	2412.11741
2024-12-16	Steering Language Models with Game-Theoretic Solvers	Ian Gemp et.al.	2402.01704
2024-12-15	LAW: Legal Agentic Workflows for Custody and Fund Services Contracts	William Watson et.al.	2412.11063_(COLING)
2024-12-13	KVDirect: Distributed Disaggregated LLM Inference	Shiyang Chen et.al.	2501.14743
2024-12-06	Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free Dynamic Triangular Attention Pattern	Hongyin Tang et.al.	2412.04757
2024-12-05	A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts	Suyu Ge et.al.	2410.01485
2024-11-27	FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving	Ao Shen et.al.	2411.18424
2024-11-22	Rapid Integration of LLMs in Healthcare Raises Ethical Concerns: An Investigation into Deceptive Patterns in Social Robots	Robert Ranisch et.al.	2410.00434
2024-11-14	Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning	Yu Fu et.al.	2410.19258
2024-11-14	Large Language Models for Power Scheduling: A User-Centric Approach	Thomas Mongaillard et.al.	2407.00476
2024-11-08	Eigen Attention: Attention in Low-Rank Space for KV Cache Compression	Utkarsh Saxena et.al.	2408.05646
2024-11-05	AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution	Zhiqiang Xie et.al.	2411.03519
2024-11-05	SAUCE: Synchronous and Asynchronous User-Customizable Environment for Multi-Agent LLM Interaction	Shlomo Neuberger et.al.	2411.03397
2024-11-03	A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression	Alessio Devoto et.al.	2406.11430_(EMNLP)
2024-11-02	NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference	Xuanlin Jiang et.al.	2411.01142
2024-11-01	Understanding Communication Preferences of Information Workers in Engagement with Text-Based Conversational Agents	Ananya Bhattacharjee et.al.	2410.20468
2024-10-31	ALISE: Accelerating Large Language Model Serving with Speculative Scheduling	Youpeng Zhao et.al.	2410.23537_(ICC)
2024-10-25	Fast Inference for Augmented Large Language Models	Rana Shahout et.al.	2410.18248
2024-10-21	Do Large Language Models Need a Content Delivery Network?	Yihua Cheng et.al.	2409.13761
2024-10-17	LLoCO: Learning Long Contexts Offline	Sijun Tan et.al.	2404.07979_(EMNLP)
2024-10-16	COMET: Towards Partical W4A4KV4 LLMs Serving	Lian Liu et.al.	2410.12168
2024-10-14	DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads	Guangxuan Xiao et.al.	2410.10819
2024-10-11	OpenCity: A Scalable Platform to Simulate Urban Activities with Massive LLM Agents	Yuwei Yan et.al.	2410.21286
2024-10-09	LayerKV: Optimizing Large Language Model Serving with Layer-wise KV Cache Management	Yi Xiong et.al.	2410.00428
2024-10-08	KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches	Jiayi Yuan et.al.	2407.01527
2024-10-07	Fast State Restoration in LLM Serving with HCache	Shiwei Gao et.al.	2410.05004_(EuroSys)
2024-10-07	KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head	Isaac Rehg et.al.	2410.00161
2024-10-06	SafeLLM: Domain-Specific Safety Monitoring for Large Language Models: A Case Study of Offshore Wind Maintenance	Connor Walker et.al.	2410.10852
2024-10-04	LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy	Rongzhi Zhang et.al.	2410.03111
2024-10-03	Preble: Efficient Distributed Prompt Scheduling for LLM Serving	Vikranth Srivatsa et.al.	2407.00023
2024-10-03	Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1	Karthik Valmeekam et.al.	2410.02162
2024-09-23	BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models	Bodun Hu et.al.	2404.18322
2024-09-16	Scalable Differential Privacy Mechanisms for Real-Time Machine Learning Applications	Jessica Smith et.al.	2410.02462
2024-09-11	Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU	Zhenyu Ning et.al.	2409.09086
2024-08-05	SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving	Andreas Kosmas Kakolyris et.al.	2408.05235
2024-08-04	TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding	Hanshi Sun et.al.	2404.11912
2024-08-01	ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition	Lu Ye et.al.	2402.15220_(ACL)
2024-08-01	Intermittent Semi-working Mask: A New Masking Paradigm for LLMs	Mingcong Lu et.al.	2408.00539
2024-07-26	Collaborative Evolving Strategy for Automatic Data-Centric Development	Xu Yang et.al.	2407.18690
2024-07-25	KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache	Zirui Liu et.al.	2402.02750_(ICML)
2024-07-23	Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference	Piotr Nawrot et.al.	2403.09636
2024-07-22	vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving	Jiale Xu et.al.	2407.15309
2024-07-22	RazorAttention: Efficient KV Cache Compression Through Retrieval Heads	Hanlin Tang et.al.	2407.15891
2024-07-21	Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks	Zheng Wang et.al.	2407.08454
2024-07-19	CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving	Yuhan Liu et.al.	2310.07240_(SIGCOMM)
2024-07-18	QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead	Amir Zandieh et.al.	2406.03482
2024-07-11	Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs	Ben Athiwaratkun et.al.	2403.08845
2024-06-30	Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention	Bin Gao et.al.	2403.19708_(ATC)
2024-06-28	InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management	Wonbeom Lee et.al.	2406.19707_(OSDI)
2024-06-16	EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism	Yanxi Chen et.al.	2312.04916_(ICML)
2024-06-08	QCQA: Quality and Capacity-aware grouped Query Attention	Vinay Joshi et.al.	2406.10247
2024-06-06	SGLang: Efficient Execution of Structured Language Model Programs	Lianmin Zheng et.al.	2312.07104
2024-05-13	Hydragen: High-Throughput LLM Inference with Shared Prefixes	Jordan Juravsky et.al.	2402.05099
2024-05-06	Federated Reinforcement Learning with Constraint Heterogeneity	Hao Jin et.al.	2405.03236
2024-05-01	Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing	KV Aditya Srivatsa et.al.	2405.00467_(ACL)
2024-04-15	Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models	Siyan Zhao et.al.	2404.09529
2024-04-06	The Case for Developing a Foundation Model for Planning-like Tasks from Scratch	Biplav Srivastava et.al.	2404.04540
2024-03-26	ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching	Youpeng Zhao et.al.	2403.17312_(ISCA)
2024-03-18	FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines	Jiaao He et.al.	2403.11421
2024-03-04	DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving	Foteini Strati et.al.	2403.01876
2024-03-04	LLM-based Smart Reply (LSR): Enhancing Collaborative Performance with ChatGPT-mediated Smart Reply System	Ashish Bastola et.al.	2306.11980
2024-02-04	Conversational Crowdsensing: A Parallel Intelligence Powered Novel Sensing Approach	Zhengqiu Zhu et.al.	2402.06654
2024-01-20	On the Prospects of Incorporating Large Language Models (LLMs) in Automated Planning and Scheduling (APS)	Vishal Pallagani et.al.	2401.02500
2023-12-26	Natural Language based Context Modeling and Reasoning for Ubiquitous Computing with Large Language Models: A Tutorial	Haoyi Xiong et.al.	2309.15074
2023-11-09	Towards A Natural Language Interface for Flexible Multi-Agent Task Assignment	Jake Brawer et.al.	2311.00153
2023-10-30	SheetCopilot: Bringing Software Productivity to the Next Level through Large Language Models	Hongxin Li et.al.	2305.19308_(NeurIPS)
2023-09-19	MindAgent: Emergent Gaming Interaction	Ran Gong et.al.	2309.09971
2023-09-12	Efficient Memory Management for Large Language Model Serving with PagedAttention	Woosuk Kwon et.al.	2309.06180_(SOSP)
2023-06-09	S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput	Yunho Jin et.al.	2306.06000