Open to full-time roles

AI systems · silicon to agents

From GPU kernels
to autonomous agents.

M.S. ECE. I optimize CUDA & Triton kernels and LLM inference at the bottom of the stack, and build agent infrastructure, retrieval, and evaluation on top.

Explore the stack →Résumé

M.S. ECE · UW-Madison '25Milwaukee, WITargeting GPU · Inference · AI-Infra

Fig.01 · The StackSilicon → Agents

Agents & Orchestration

multi-agent · MCP · governance

31autonomous agents

Retrieval & Memory

sqlite-vec · BM25 · RRF · rerank

0.745nDCG@10 · vs .665 BM25

LLM Inference

FlashAttention · KV-cache · Nsight

15.4KJetson · single-stream decode

GPU Kernels · Silicon

CUDA · Triton · GEMM fusion

213Triton · TFLOP/s

The whole column

I work the whole stack.

From hand-written GPU kernels at the bottom to multi-agent systems at the top. Working every layer is how I find the one that owns the latency.

Agents & Orchestration

multi-agent · MCP · governance

Retrieval & Memory

sqlite-vec · BM25 · RRF · rerank

0.745

LLM Inference

FlashAttention · KV-cache · Nsight

15.4K

GPU Kernels · Silicon

CUDA · Triton · GEMM fusion

213

04 · Agents

Autonomous systems that build software.

An autonomous ML research lab and HIVE, a multi-agent org, ship from pre-registered plans. I built their orchestration and anti-forgery sign-off governance.

03 · Retrieval

Memory that beats the context wall.

bert gives agents a searchable memory of a whole project. Hybrid retrieval, fully local, built for projects that outgrow the context window.

02 · Inference

LLM inference, profiled and tuned.

Attention and KV-cache decode profiled on edge silicon: fused SDPA (FlashAttention backend) vs naive attention. 15.4K tok/s on a single-stream KV-cache decode microbenchmark, Jetson Orin Nano.

01 · Kernels

Down to the metal.

Hand-written Triton kernels, GEMM fusion, removing HBM round-trips. Up to 1.73× lower latency than the unfused path on memory-bound FFN. The bare autotuned GEMM peaks at 213 TFLOP/s on an A100.

Layer 03 · Retrieval & Memory · Case 01

bert

A local MCP server that gives a coding agent searchable memory of a whole project. Once the project outgrows the context window, stuffing everything in fails and truncation drops the answer. bert pulls back only the slice that holds it.

View the repo →

Accuracy vs. project-memory sizeone Claude reader, only the method varies

0.90answer accuracy past a 200K window, one Claude reader

The problem

A project's memory outgrows the model's window.

A mature project's decisions, designs, and post-mortems run to millions of tokens, beyond even a 1M window. Stuff it all in and the request fails. Keep only the recent slice and you lose the fact you need. Accuracy falls apart either way.

The build

Hybrid retrieval, fully local.

sqlite-vec stores bge-base-en-v1.5 dense vectors, BM25 covers exact keywords, reciprocal-rank fusion merges the two, and a cross-encoder reranks the top hits. No API keys, no model calls. The host sees only the slice that matters.

The result

Memory holds where stuffing the context collapses.

One reader (Claude Sonnet, 200K window), one variable: how it gets the context. On a 1.26M-token project memory, about 6× that window, full-context drops to 0.08 while bert holds 0.90. bert matches an agent grepping the raw files and spends half the tokens to do it, and it beats a plain vector lookup by 0.50. The honest caveat: on source code, where any file re-reads cheaply, grep wins. bert is built for accumulated memory, not code search.

The debugging win

A silent bug had pinned it at near-random.

0.10 → 0.85

A dict-key mismatch zeroed the dense signal, and a 240-character cap chopped the answer spans, so the hybrid path was really keyword-only. I traced it, fixed it against the shipped retriever, and the held-out eval jumped from 0.10 to 0.85.

Validated

0.745 nDCG@10 on a public IR benchmark.

Three public BEIR datasets: scifact, nfcorpus, fiqa. bge-base matches its published reference on all three. On scifact the full stack scores 0.745, past the published BM25 baseline of 0.665 and level with its own published mark of 0.741. Public data, public metric, nothing I defined myself.

Interface

MCP

Local stdio server, JSON-RPC 2.0. Drops into Claude Code, Cursor, Codex.

BEIR scifact

0.745

nDCG@10, bge-base + BM25 + cross-encoder rerank. +0.080 over BM25 (0.665).

Over-window

0.90

Accuracy on a 1.26M-token project memory (~6× a 200K reader window), where full-context falls to 0.08.

vs. baselines

+0.50

Beats naive vector-RAG by 0.50, and matches an agent grepping the files at half the token cost.

Layer 01 · GPU Kernels · Case 02

Triton vs cuBLAS

Where does fusing the epilogue beat the standard unfused path (a cuBLAS GEMM plus separate bias and GeLU kernels)? I built the benchmark to find out, across 76 LLM-shaped GEMMs on an A100.

View the repo →

three passes, three HBM round-tripsA100 · fp16

linear→ HBM

bias→ HBM

GeLU→ HBM

3 kernel launches, 3 round-trips to global memory. On small batches the GPU moves more data than it computes.

linear + bias + GeLUone fused kernel

One launch. The intermediate stays in registers. Two HBM round-trips and two kernel launches, gone.

1.73×

vs the unfused path
on small-batch FFN

unfused

1.00×

Triton fused

1.73×

76 shapes profiled. Median 0.96× (near parity), 27 shapes win. Fusion pays off on small-batch, memory-bound FFN, the latency-bound decode regime. At large batch the GEMM goes compute-bound and the vendor path wins.

Bare autotuned GEMM peaks at 213 TFLOP/s, 68% of the A100's fp16 tensor-core peak (cuBLAS reaches ~82% on the same shapes, so the win is fusion, not raw GEMM throughput). p50 of CUDA-event-timed runs, fp16. Baseline is the unfused path: a cuBLAS GEMM plus separate bias and GeLU, PyTorch eager (2.1+).

The problem

Small-batch FFN projections are memory-bound.

Linear, then bias, then GeLU: three passes, each writing to HBM and relaunching. On small batches the GPU spends more time moving data than computing.

The build

Fuse the three into one Triton kernel.

linear + bias + GeLU in a single launch, keeping the intermediate in registers. Two HBM round-trips and two kernel launches, gone.

The result

Up to 1.73× lower latency.

1.73×

On the best small-batch FFN shape, M=128, N=11008, K=4096. The bare autotuned Triton GEMM peaks at 213 TFLOP/s, 68% of the A100's fp16 tensor-core peak.

The finding

Where custom kernels pay off.

I profiled all 76 shapes on p50/p90 latency, throughput, bandwidth, and jitter to find the line where a fused kernel beats cuBLAS and where the vendor library already wins. Small, skinny GEMMs favor fusion. Mapping that boundary is the result.

Shapes profiled

LLM-shaped GEMMs across Triton and cuBLAS on an A100.

Peak throughput

213

TFLOP/s, 68% of fp16 tensor-core peak, bare autotuned GEMM.

Best speedup

1.73×

on small-batch FFN. Median across 76 shapes is 0.96×, parity, with 27 wins.

Tooling

Nsight

CUDA event timing, roofline, tail-jitter analysis.

Selected work

LLM Inference · Edge · Featured

Jetson Orin Nano profiling

Attention and KV-cache decode kernels profiled with Nsight on a 15W edge SoC. Fused SDPA measured 5.6× over naive attention. Single-stream KV-cache decode held ~15.4K tok/s on synthetic shapes before the cache-growth knee.

5.6×fused SDPA vs naive

Autonomous Pipeline · Featured

The Obsidian Archive

A 13-stage pipeline of 15 specialized agents that researches, scripts, renders, and uploads documentaries to a live YouTube channel, with every video gated by per-stage quality checks, fact verification, and automatic rewrites of weak sections. Deployed on Railway with a Supabase backend.

13stages · 15 agents

Autonomous Lab

31-agent ML research lab

Agents write a from-scratch MoE transformer in C17: hand-written tensors, backprop, 4-bit QAT, plus Apple Metal kernels wired in. No PyTorch.

C17 · MoE + 4-bit QAT

Autonomous Org

HIVE

A privacy-first peer-to-peer reasoning engine in Rust: Rete-II + JTMS, local differential privacy, libp2p mesh with Dandelion++ anonymity. Built by a 45-role agent org I designed and run.

Rust · 1,300+ tests

CUDA Autotuning

ML-guided kernel config

A PyTorch MLP surrogate predicts kernel runtime and replaces exhaustive autotuning with argmin-over-grid.

R² = 0.96 · MAE 0.018 ms

Memory Systems

MI300X + gem5 modeling

Characterized the AMD MI300X memory hierarchy (UW-Madison course project). Parameter tuning cut gem5's vL1D latency MAPE from 89% to 5%, and pinned the structural gaps tuning alone cannot fix.

89% → 5% MAPE · 4.6 to 4.8 TB/s write

About

I keep the results that disagree with me. Writing CUDA and Triton kernels taught me which operations are memory-bound and which are compute-bound. Building the layers on top, LLM inference, retrieval, and multi-agent systems, taught me which of those bottlenecks the real workload actually cares about. I measure all of it end to end.

Kernels to agents

CUDA and Triton GEMMs at the bottom, retrieval and multi-agent systems at the top. Every layer on this page has a project behind it.

Everything is benchmarked

Each number here traces to a benchmark in the repo. bert grades its own retrieval and logs the runs where it loses.

Runs on one laptop

The from-scratch C17 training engine and the autonomous labs all run on a single 18 GB MacBook. No cloud.

Writing

May 2026Below the Measurement Floor: a pre-registered dense-vs-MoE nullPre-registered compute-matched comparison. Every cell lands below the measurement floor, and the null is the result.Report May 2026Recursive Verification-Surface CollapseThe null-set principal problem in self-grading labs, and three structural fixes, from 100+ autonomous cycles.Report May 2026When the LLM judge is biasedSame-family preference leakage when four Claude instances grade each other, tested against an ICLR 2026 result.Note May 2026Tier-per-task: routing models by the jobWhy model tiering belongs at the task level, not the role level.Note