From GPU kernels
to autonomous agents.
M.S. ECE. I optimize CUDA & Triton kernels and LLM inference at the bottom of the stack, and build agent infrastructure, retrieval, and evaluation on top.
I work the whole stack.
From hand-written GPU kernels at the bottom to multi-agent systems at the top. Working every layer is how I find the one that owns the latency.
Autonomous systems that build software.
An autonomous ML research lab and HIVE, a multi-agent org, ship from pre-registered plans. I built their orchestration and anti-forgery sign-off governance.
Memory that beats the context wall.
bert gives agents a searchable memory of a whole codebase. Hybrid retrieval, fully local, built for projects that outgrow a 1M-token window.
LLM inference, profiled and tuned.
Attention and KV-cache decode profiled on edge silicon: fused SDPA (FlashAttention backend) vs naive attention. 15.4K tok/s on a single-stream KV-cache decode microbenchmark, Jetson Orin Nano.
Down to the metal.
Hand-written Triton kernels, GEMM fusion, removing HBM round-trips. Up to 1.73× lower latency than the unfused path on memory-bound FFN. The bare autotuned GEMM peaks at 213 TFLOP/s on an A100.
bert
A local MCP server that gives AI coding assistants a searchable memory of an entire codebase. Past the context window, full-context stuffing fails and truncation drops the answer. Retrieval reads only the slice that matters.
A 3M-token codebase does not fit in a 1M-token window.
At 3M tokens the project is 3× the window, so full-context is not an option. Truncate to fit and you drop the file the answer lives in. Accuracy collapses toward 0.00.
Hybrid retrieval, fully local.
Dense vectors (sqlite-vec + bge-base-en-v1.5) and BM25 keyword search, fused by reciprocal-rank fusion, reranked by a cross-encoder. No API keys, no LLM calls. The host model only sees the slice that matters.
Accuracy holds as the codebase grows.
When the project fits in the window, full-context stuffing wins outright. Past it, bert retrieves only the relevant slice and holds 0.85 down to 0.75 at 3M on a span-validated gold set, reading 4.6× fewer input tokens than truncation.
A silent bug had pinned it at near-random.
A dict-key mismatch zeroed the dense signal and a 240-char truncation dropped answer spans. Root-caused and fixed, accuracy went from 0.10 to 0.85.
0.745 nDCG@10 on a public IR benchmark.
Across three BEIR datasets (scifact, nfcorpus, fiqa) bge-base matches the published reference on all three. On scifact the full stack (bge-base-en-v1.5 dense vectors plus BM25, reranked by a cross-encoder) scores 0.745 against a published BM25 baseline of 0.665, on par with bge-base-en-v1.5's own published scifact result (0.741). A public benchmark on public datasets, not a self-defined one.
Triton vs cuBLAS
Where does fusing the epilogue beat the standard unfused path (a cuBLAS GEMM plus separate bias and GeLU kernels)? I built the benchmark to find out, across 76 LLM-shaped GEMMs on an A100.
on small-batch FFN
Small-batch FFN projections are memory-bound.
Linear, then bias, then GeLU: three passes, each writing to HBM and relaunching. On small batches the GPU spends more time moving data than computing.
Fuse the three into one Triton kernel.
linear + bias + GeLU in a single launch, keeping the intermediate in registers. Two HBM round-trips and two kernel launches, gone.
Up to 1.73× lower latency.
On the best small-batch FFN shape, M=128, N=11008, K=4096. The bare autotuned Triton GEMM peaks at 213 TFLOP/s, 68% of the A100's fp16 tensor-core peak.
Where custom kernels pay off.
I profiled all 76 shapes on p50/p90 latency, throughput, bandwidth, and jitter to find the line where a fused kernel beats cuBLAS and where the vendor library already wins. Small, skinny GEMMs favor fusion. Mapping that boundary is the result.
Selected work
About
I keep the results that disagree with me. Writing CUDA and Triton kernels taught me which operations are memory-bound and which are compute-bound. Building the layers on top, LLM inference, retrieval, and multi-agent systems, taught me which of those bottlenecks the real workload actually cares about. I measure all of it end to end.