Open to full-time roles
AI systems · silicon to agents

From GPU kernels
to autonomous agents.

M.S. ECE. I optimize CUDA & Triton kernels and LLM inference at the bottom of the stack, and build agent infrastructure, retrieval, and evaluation on top.

M.S. ECE · UW-Madison '25Milwaukee, WITargeting GPU · Inference · AI-Infra
Fig.01 · The StackSilicon → Agents
04
Agents & Orchestration
multi-agent · MCP · governance
31autonomous agents
03
Retrieval & Memory
sqlite-vec · BM25 · RRF · rerank
0.745nDCG@10 · vs .665 BM25
02
LLM Inference
FlashAttention · KV-cache · Nsight
15.4KJetson · single-stream decode
01
GPU Kernels · Silicon
CUDA · Triton · GEMM fusion
213Triton · TFLOP/s
The whole column

I work the whole stack.

From hand-written GPU kernels at the bottom to multi-agent systems at the top. Working every layer is how I find the one that owns the latency.

04
Agents & Orchestration
multi-agent · MCP · governance
31
03
Retrieval & Memory
sqlite-vec · BM25 · RRF · rerank
0.745
02
LLM Inference
FlashAttention · KV-cache · Nsight
15.4K
01
GPU Kernels · Silicon
CUDA · Triton · GEMM fusion
213
04 · Agents

Autonomous systems that build software.

An autonomous ML research lab and HIVE, a multi-agent org, ship from pre-registered plans. I built their orchestration and anti-forgery sign-off governance.

03 · Retrieval

Memory that beats the context wall.

bert gives agents a searchable memory of a whole codebase. Hybrid retrieval, fully local, built for projects that outgrow a 1M-token window.

02 · Inference

LLM inference, profiled and tuned.

Attention and KV-cache decode profiled on edge silicon: fused SDPA (FlashAttention backend) vs naive attention. 15.4K tok/s on a single-stream KV-cache decode microbenchmark, Jetson Orin Nano.

01 · Kernels

Down to the metal.

Hand-written Triton kernels, GEMM fusion, removing HBM round-trips. Up to 1.73× lower latency than the unfused path on memory-bound FFN. The bare autotuned GEMM peaks at 213 TFLOP/s on an A100.

Layer 03 · Retrieval & Memory · Case 01

bert

A local MCP server that gives AI coding assistants a searchable memory of an entire codebase. Past the context window, full-context stuffing fails and truncation drops the answer. Retrieval reads only the slice that matters.

View the repo →
Accuracy vs. codebase sizehttpx + starlette corpus (padded to 3M)
1.0.75.50.250 10K100K1M3M codebase size · tokens, log scale 1M window bert 0.85 0.75 truncation → 0
0.85answer accuracy · span-validated gold set
The problem

A 3M-token codebase does not fit in a 1M-token window.

At 3M tokens the project is 3× the window, so full-context is not an option. Truncate to fit and you drop the file the answer lives in. Accuracy collapses toward 0.00.

The build

Hybrid retrieval, fully local.

Dense vectors (sqlite-vec + bge-base-en-v1.5) and BM25 keyword search, fused by reciprocal-rank fusion, reranked by a cross-encoder. No API keys, no LLM calls. The host model only sees the slice that matters.

The result

Accuracy holds as the codebase grows.

When the project fits in the window, full-context stuffing wins outright. Past it, bert retrieves only the relevant slice and holds 0.85 down to 0.75 at 3M on a span-validated gold set, reading 4.6× fewer input tokens than truncation.

The debugging win

A silent bug had pinned it at near-random.

0.10 0.85

A dict-key mismatch zeroed the dense signal and a 240-char truncation dropped answer spans. Root-caused and fixed, accuracy went from 0.10 to 0.85.

Validated

0.745 nDCG@10 on a public IR benchmark.

Across three BEIR datasets (scifact, nfcorpus, fiqa) bge-base matches the published reference on all three. On scifact the full stack (bge-base-en-v1.5 dense vectors plus BM25, reranked by a cross-encoder) scores 0.745 against a published BM25 baseline of 0.665, on par with bge-base-en-v1.5's own published scifact result (0.741). A public benchmark on public datasets, not a self-defined one.

Interface
MCP
Local stdio server, JSON-RPC 2.0. Drops into Claude Code, Cursor, Codex.
BEIR scifact
0.745
nDCG@10, bge-base + BM25 + cross-encoder rerank. +0.080 over BM25 (0.665).
Over-window
0.75
Accuracy on a blind-authored, span-validated set (small n) at 3M tokens, where full-context is infeasible and truncation is 0.00.
Cost
~3.3K
Input tokens per query. ~4.6× fewer than truncation, higher accuracy.
Layer 01 · GPU Kernels · Case 02

Triton vs cuBLAS

Where does fusing the epilogue beat the standard unfused path (a cuBLAS GEMM plus separate bias and GeLU kernels)? I built the benchmark to find out, across 76 LLM-shaped GEMMs on an A100.

View the repo →
three passes, three HBM round-tripsA100 · fp16
linear→ HBM
bias→ HBM
GeLU→ HBM
3 kernel launches, 3 round-trips to global memory. On small batches the GPU moves more data than it computes.
linear + bias + GeLUone fused kernel
One launch. The intermediate stays in registers. Two HBM round-trips and two kernel launches, gone.
1.73×
vs the unfused path
on small-batch FFN
unfused
1.00×
Triton fused
1.73×
PARITY 1.0×FASTER ↑1.73×1.4×0.6×12825651210242048batch size M
76 shapes profiled. Median 0.96× (near parity), 27 shapes win. Fusion pays off on small-batch, memory-bound FFN, the latency-bound decode regime. At large batch the GEMM goes compute-bound and the vendor path wins.
Bare autotuned GEMM peaks at 213 TFLOP/s, 68% of the A100's fp16 tensor-core peak (cuBLAS reaches ~82% on the same shapes, so the win is fusion, not raw GEMM throughput). p50 of CUDA-event-timed runs, fp16. Baseline is the unfused path: a cuBLAS GEMM plus separate bias and GeLU, PyTorch eager (2.1+).
The problem

Small-batch FFN projections are memory-bound.

Linear, then bias, then GeLU: three passes, each writing to HBM and relaunching. On small batches the GPU spends more time moving data than computing.

The build

Fuse the three into one Triton kernel.

linear + bias + GeLU in a single launch, keeping the intermediate in registers. Two HBM round-trips and two kernel launches, gone.

The result

Up to 1.73× lower latency.

1.73×

On the best small-batch FFN shape, M=128, N=11008, K=4096. The bare autotuned Triton GEMM peaks at 213 TFLOP/s, 68% of the A100's fp16 tensor-core peak.

The finding

Where custom kernels pay off.

I profiled all 76 shapes on p50/p90 latency, throughput, bandwidth, and jitter to find the line where a fused kernel beats cuBLAS and where the vendor library already wins. Small, skinny GEMMs favor fusion. Mapping that boundary is the result.

Shapes profiled
76
LLM-shaped GEMMs across Triton and cuBLAS on an A100.
Peak throughput
213
TFLOP/s, 68% of fp16 tensor-core peak, bare autotuned GEMM.
Best speedup
1.73×
on small-batch FFN. Median across 76 shapes is 0.96×, parity, with 27 wins.
Tooling
Nsight
CUDA event timing, roofline, tail-jitter analysis.
02

Selected work

LLM Inference · Edge · Featured
Jetson Orin Nano profiling
Attention and KV-cache decode kernels profiled with Nsight on a 15W edge SoC. Fused SDPA measured 5.6× over naive attention. Single-stream KV-cache decode held ~15.4K tok/s on synthetic shapes before the cache-growth knee.
5.6×fused SDPA vs naive
Autonomous Pipeline · Featured
The Obsidian Archive
A 13-stage pipeline of 15 specialized agents that researches, scripts, renders, and uploads documentaries to a live YouTube channel, with every video gated by per-stage quality checks, fact verification, and automatic rewrites of weak sections. Deployed on Railway with a Supabase backend.
13stages · 15 agents
Autonomous Lab
31-agent ML research lab
Agents write a from-scratch MoE transformer in C17: hand-written tensors, backprop, 4-bit QAT, plus Apple Metal kernels wired in. No PyTorch.
C17 · MoE + 4-bit QAT
Autonomous Org
HIVE
A privacy-first peer-to-peer reasoning engine in Rust: Rete-II + JTMS, local differential privacy, libp2p mesh with Dandelion++ anonymity. Built by a 45-role agent org I designed and run.
Rust · 1,300+ tests
CUDA Autotuning
ML-guided kernel config
A PyTorch MLP surrogate predicts kernel runtime and replaces exhaustive autotuning with argmin-over-grid.
R² = 0.96 · MAE 0.018 ms
Memory Systems
MI300X + gem5 modeling
Characterized the AMD MI300X memory hierarchy (UW-Madison course project). Parameter tuning cut gem5's vL1D latency MAPE from 89% to 5%, and pinned the structural gaps tuning alone cannot fix.
89% → 5% MAPE · 4.6 to 4.8 TB/s write
03

About

I keep the results that disagree with me. Writing CUDA and Triton kernels taught me which operations are memory-bound and which are compute-bound. Building the layers on top, LLM inference, retrieval, and multi-agent systems, taught me which of those bottlenecks the real workload actually cares about. I measure all of it end to end.

01
Kernels to agents
CUDA and Triton GEMMs at the bottom, retrieval and multi-agent systems at the top. Every layer on this page has a project behind it.
02
Everything is benchmarked
Each number here traces to a benchmark in the repo. bert grades its own retrieval and logs the runs where it loses.
03
Runs on one laptop
The from-scratch C17 training engine and the autonomous labs all run on a single 18 GB MacBook. No cloud.
04

Writing