Mini-vLLM

PythonTritonCUDALLMs

I wanted to really understand how production LLM serving works under the hood, so I built my own inference engine inspired by vLLM. The key insight is that naive inference wastes a ton of GPU memory on the KV cache - the attention states that grow with each token generated. Mini-vLLM implements PagedAttention, which manages the KV cache like an operating system manages virtual memory - in fixed-size blocks that can be allocated and freed dynamically. This eliminates fragmentation and lets you serve way more concurrent requests. I also added continuous batching (dynamically scheduling requests for maximum throughput), prefix caching (sharing cached system prompts across requests), and speculative decoding for faster generation. The whole thing runs on custom Triton GPU kernels I wrote for the paged attention operations. It achieves 2-3x throughput over naive HuggingFace inference and exposes an OpenAI-compatible API so it's a drop-in replacement.