Research

Home  |  Research  |  Projects  |  Blog  |  Courses  |  Ask Me

Research Interests

My research focuses on the intersection of machine learning and systems, with particular emphasis on:

  • Machine Learning Systems: Optimizing the training and inference of large-scale ML models
  • Distributed Computing: Efficient parallelization strategies for deep learning workloads
  • Compiler Optimization: Automatic optimization of ML computations for various hardware backends
  • Memory Management: Novel techniques for training larger models with limited memory
  • Hardware-Software Co-design: Designing systems that leverage modern accelerators effectively
Featured Research: LinuxGuard

My primary focus at ADSL is LinuxGuard, a pipeline that learns from Linux kernel bug fixes to generate custom clang-tidy checkers. By mining commit history, the system builds AST matchers that flag unchecked error paths across kernels v3.0 through v6.0, turning each fixed vulnerability into a proactive safeguard.

  • End-to-end automation: bug mining, checker synthesis, LLVM build, and multi-version scans.
  • Flagship checker: linuxkernel-must-check-errs catches missing error handling at scale.
  • 200+ unchecked error flows surfaced per kernel release, revealing long-lived anti-patterns.
WukLab @ UCSD — Prof. Yiying Zhang

As a remote research intern at WukLab, I work with Professor Yiying Zhang on two complementary threads at the LLM-systems interface: using LLMs to produce better low-level code, and using systems measurement to understand what really limits LLM serving on commodity edge hardware.

LLM kernel generation pipeline LLM-driven Low-level Code Optimization
Xuming Huang, advised by Yiying Zhang — WukLab, UC San Diego
2025–present · Ongoing

Building an agent loop that takes a reference kernel, proposes compiler-grade transformations (vectorization, tiling, register-pressure reduction), then verifies each candidate by running it against the original and benchmarking the survivors. The goal: close the gap between LLM-suggested code edits and the kind of optimizations a production compiler engineer would accept.

  • Pipeline: propose → sandbox-execute → differential-test → profile, fully closed-loop.
  • Targets numerical and memory-bound kernels where the search space is too large for hand tuning.
  • Co-designed with the edge-inference workload below so the optimized kernels can be shipped to the same M-series device profile.
Peak RSS and decode throughput vs scenario System Optimization for Edge Device Inference
Xuming Huang, advised by Yiying Zhang — WukLab, UC San Diego
2025–present · Ongoing

A measurement-first study of what actually governs latency for agentic LLMs running entirely on-device. We drive ollama serving qwen2.5 (3B / 7B / 14B at q4_K_M) on a 24 GB M3 MacBook Air through an 11-scenario sweep that walks the model-vs-KV-cache budget, and through an 8-scenario suite that exercises OpenClaw's prefix cache under cold prime, plain append, retrieval-inlined, session growth, and auto-compaction workloads.

  • Memory-budget model: RSS ≈ model_GB + parallel × num_ctx × KV_q8_bytes + 0.3 GB, mean abs. error 10.2 % across 11 scenarios.
  • Found a page-compressor cliff at parallel=3 on 14B / 32k ctx where observed RSS stops tracking predicted by 20–28 %.
  • Quantified a 35× prefill spike caused by auto-compaction breaking the ollama prefix cache between two consecutive turns.
  • Showed decode is bandwidth-bound: 3B ≈ 35 tok/s, 7B ≈ 5 tok/s, 14B ≈ 2.5 tok/s, near-constant across context length.

Current phase: running the full 361-task OSWorld-Verified benchmark with Holo3-35B-A3B — the #1 open-weight computer-use model — served entirely on an Intel Panther Lake NUC (96 GB unified, CPU-only llama.cpp after measured GPU/NPU bring-up), replicating the official leaderboard config (100-step budget, screenshot-only) with per-step latency, token, KV-cache and 5-domain RAPL energy profiling.

  • The OSWorld prefill wall is quadratic in steps — an artifact of in-place screenshot eviction breaking the KV prefix; an append-only, never-evict context (affordable because the hybrid model caches only 10/40 layers → a 262K window is ~5 GB KV in the 96 GB pool) turns it linear: a 4–10× prefill cut, runtime-agnostic.
  • The Xe3 iGPU is 2.5× faster at prefill (Vulkan), but full offload is blocked by two localized correctness bugs (248k-vocab output projection, vision encoder crash); --cache-reuse is ruled out as inert & unsound for this hybrid arch. CPU stays the fastest correct path.

[🔴 Live benchmark tracker]   [📊 Profiling corpus & schema]   [⚡ KV-cache & iGPU study]   [Interactive workflow demo]   [Memory hierarchy demo]

KV cache scenarios

KV prefix-cache hit / miss / surprise-miss behavior across 8 OpenClaw workloads. Auto-compaction (scenario 8) drives the longest prefill bar.

Append vs prefill trend

Append-token count vs. next call's prefill latency across 39 partial-hit data points — the cache-miss event sits ~24× off the trend.

Research Projects

Representative projects are highlighted. See also my Google Scholar profile.


Feel free to clone my template Xuming Huang