Research
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
As model serving scenarios become more diverse, it is increasingly challenging for generic LLM serving frameworks to support all combinations of hardware, model, and application workload that users care about. VibeServe pursues an alternative: generate a bespoke LLM serving system per scenario, unlocking optimizations that are difficult to express in a generic serving stack. Across diverse scenarios, VibeServe's agent-generated bespoke LLM serving systems show up to 6.27x speedup with scenario-specific optimizations.
NEMO: Flexible, High-Fidelity Memory Telemetry
Modern servers use complex memory hierarchies that require active OS management, but the OS lacks low-overhead and accurate visibility into how memory is actually used, resulting in poor policy decisions. Existing approaches either rely on software that is flexible but has high overhead, or hardware counters that are efficient but too coarse and inflexible. Our system breaks this trade-off with HW/SW co-design. It introduces a small, programmable monitoring mechanism inside memory controllers that lets the OS directly track memory activities of interest, achieving both accuracy and flexibility with low overhead.
NEMO is accepted to OSDI '26; camera-ready paper to come.
Self-Defining Operator
LLM agents can automate operational tasks for production applications, but treat each session as a clean slate—repeatedly rediscovering architecture and re-solving the same incidents. Self-Defining Operator (SDO) is a multi-agent system that deploys and operates production applications, carrying knowledge across sessions through a persistent knowledge base of architectural summaries, runbooks, and incident post-mortems. In early experiments, seeding SDO with accumulated context yields 2.5x fewer deployment iterations and 38% faster failure mitigation.
FlowGuard: Information Flow Control for Coding Agents
Coding agents powered by large language models are increasingly being used to automate software development tasks. However, these agents can inadvertently leak sensitive information or execute unauthorized operations when interacting with external systems. FlowGuard addresses these security concerns by applying information flow control to coding agents, ensuring that sensitive data and operations are properly isolated and controlled throughout the agent's execution.
Masa: Scheduling Microservice RPCs by SLO Slack
Microservice RPC runtimes leave goodput on the table by scheduling ready work with coarse signals such as local arrival order or static caller/API priority. These policies miss how much downstream work remains, so they can delay RPCs that must run now to keep the end-to-end request on time. Masa improves goodput with distributed least-slack-first scheduling: services estimate remaining work online, propagate slack with RPC metadata, and prioritize ready RPCs by slack. Compared with SotA baselines, Masa improves goodput by up to 4.3x with fixed resources and saves up to 52% of replicas at matched goodput.
Under submission. Preprint available upon request.
Loom: Efficient Telemetry for Production Systems
Debugging production systems today requires a difficult trade-off between how much data is collected and how quickly it can be queried: indexing for fast queries often slows ingestion and forces data to be dropped. Furthermore, data collection must not slow down the application being monitored. Loom breaks this trade-off by showing that fast queries do not require indexing every record: coarse summaries over small chunks are sufficient. With a small CPU and memory footprint in production, Loom minimizes application interference.
Quicksand: Unstrand Resources with Granular Computing
Resource stranding causes substantial underutilization in datacenters and is commonly addressed through hardware-based disaggregation on emerging hardware. Quicksand argues that stranded resources can be addressed on hardware today via a new programming framework layer. Its key insight is to decompose applications into fine-grained, resource-specific units that can be scheduled onto stranded resources. Quicksand performs this decomposition transparently and dynamically adjusts it at millisecond timescales to adapt to changing resource availability.