Caleb Yenusah - HPC Engineer | Technical insights on high-performance computing, computational science, and performance optimization across scientific and AI/ML applications.

Posts

May 15, 2026
Flash Attention From Scratch: 7 Kernels to 187 TFLOPS on A100
Mar 20, 2026
Tensor Core HGEMM: Dropping to PTX mma.sync
Mar 2, 2026
Tensor Core HGEMM: A Progressive Optimization Guide Using WMMA
Feb 23, 2026
CUDA Matrix Multiply: From Naive Baseline to Near-cuBLAS Performance
Jan 22, 2026
GPU Histogram: From Global Atomics to Shared Memory Privatization
Jan 13, 2026
GPU Prefix Sum: From Multi-Kernel to Single-Pass Decoupled Lookback
Dec 24, 2025
Optimizing GPU Matrix Transpose: From 14% to 88% of Peak Bandwidth
Dec 13, 2025
GPU Parallel Reduction: Algorithm and Optimization Strategies
Nov 29, 2025
How Data Type Width Affects GPU Memory Throughput