--- title: "When is amatrix fast?" output: rmarkdown::html_vignette: toc: yes toc_depth: 2 vignette: > %\VignetteIndexEntry{When is amatrix fast?} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` amatrix exists because some linear-algebra workloads are dramatically faster on a GPU than on a CPU. It also exists because **most workloads are not**. This vignette is about telling those two apart on your machine. The short version: **a GPU helps when you're amortizing upload cost over a large amount of compute, and when the compute is GPU-shaped (dense matmul, big factorizations, iterative solvers).** For everything else — small matrices, per-row operations, low arithmetic intensity, unpredictable shapes — the CPU is either faster or not worth the routing complexity. amatrix tries very hard to route to the CPU when the GPU wouldn't win. If you're ever unsure whether a given call is routing where you want, the dispatcher is introspectable: ask it. ## The speed contract amatrix makes two promises about performance: 1. **Never worse than CPU by more than 10% on a calibrated op.** When calibration data says a backend is slower than the CPU for a given (op, size) cell, the dispatcher routes to the CPU instead. You don't need to guess — you can check. 2. **Regressions are stop-ship.** The nightly benchmark gate fails when any op slows down by more than 20% against the recorded baseline on the reference machine, so a performance regression blocks a release the same way a correctness failure does. Neither promise is magic — both require a baseline and a calibration on your hardware. Running them is a one-time step per machine. ## Calibrate once per machine Performance is machine-specific. A baseline from a MacBook M3 is not comparable to one from a Linux GPU box, and the crossover point where a GPU starts beating a CPU depends on the CPU/GPU you happen to own. The first time you use amatrix on a machine — or any time the hardware changes — run: ```{r eval = FALSE} library(amatrix) # Calibrate the currently-registered non-CPU backends. amatrix_calibrate() ``` This measures every calibration op on every registered backend at a set of canonical sizes, derives the minimum workload size at which each backend beats the CPU, and persists the result to the user cache directory. The cache is tagged with a `sys_hash` covering your OS, machine, and R version; if you move the cache to a different machine, amatrix detects the mismatch and refuses to use the stale data. Inspect the current calibration at any time: ```{r eval = FALSE} info <- amatrix_calibration_info() str(info$thresholds, max.level = 2) ``` ## Read the benchmark report `amatrix_benchmark_report()` surfaces the cold-vs-warm timings from the machine-local baseline plus the calibrated thresholds in one call: ```{r eval = FALSE} rep <- amatrix_benchmark_report() # Cold vs warm per op × backend, with warm/cold ratio. head(rep$baseline) #> op size backend cold_ms warm_ms warm_vs_cold_ratio speedup_vs_cpu #> 1 covariance 1024x128 cpu 11.19 11.24 1.00 1 #> 2 matmul 1024x128 cpu 5.17 5.17 1.00 1 #> ... #> 10 many_lm 1024x128 cpu 10.90 1.13 9.67 1 # Calibrated thresholds: for each (backend, op) pair, the minimum workload # where the backend beats CPU. head(rep$calibration) ``` `warm_vs_cold_ratio` is the signal you care about most: - **≈ 1.0** means the op has no warm-up overhead. First call and tenth call cost the same. - **> 2.0** means there's a significant cold-start cost — typically the first upload of the matrix to the backend, or the first call pulling in a JIT-compiled kernel. The warm runs are fast; the cold run is not. - For ops like `many_lm`, the ratio is close to 10× because the cache stores the QR factorization on the first call and reuses it on subsequent calls. ## When the GPU wins From smallest to largest, here's what actually happens when you call an op on an `adgeMatrix`: 1. **Small matrices (roughly < 1000 elements).** Upload latency alone dwarfs the compute. The dispatcher routes to CPU. A 32×32 matmul on MLX is slower than on base BLAS, every time. 2. **Medium matrices (10³–10⁵ elements).** Depends on the op. Dense matmul and factorizations start winning on the GPU around 512×512. Per-row reductions (`rowSums`, `colSums`) may still be CPU-bound because they're memory-bound, not compute-bound. 3. **Large matrices (> 10⁵ elements, dense).** The GPU wins decisively for `%*%`, `crossprod`, `chol`, `svd`, `qr`. Factor caching (`chol_factor`, `svd_factor`) amplifies this: the first call pays the upload + factorize cost, subsequent calls reuse the cached factor at memory speed. 4. **Very large matrices (memory-pressure regime, 4K×4K+).** The baseline tracks these explicitly because cache effects dominate. The CPU's speedup-vs-GPU ratio gets worse as n grows. 5. **Iterative algorithms** (`rsvd`, `block_lanczos`, `sinkhorn`, `irlba`). These do many sequential kernel invocations on resident data. If the input is already GPU-resident, they amortize the upload cost across hundreds of kernel launches. If the input is cold every time, you're paying for upload over and over. ## When the CPU wins (and you should let it) - **1×1, 1×n, n×1 matrices.** Always CPU. Degenerate shapes never route to a GPU backend; the package's adversarial-input test suite pins this behavior. - **Element-wise ops on tiny matrices.** The dispatch overhead exceeds the compute cost. - **Ops that allocate more than they compute.** If the GPU spends more time copying bytes than running flops, the CPU wins. - **Backends you don't have.** Optional backends (`amatrix.mlx`, `amatrix.arrayfire`, `amatrix.opencl`, `amatrix.metal`) are separate packages; if they're not installed, the dispatcher routes to CPU. The health probe in `amatrix_backend_status()` tells you what's registered and which backends passed the canary check. ## Ask the dispatcher If you're debugging a performance surprise, don't guess — ask the planner what it did: ```{r eval = FALSE} x <- adgeMatrix(matrix(rnorm(1e6), 1000, 1000)) amatrix_backend_plan(x, "matmul", y = diag(1000)) ``` The plan shows which backends were considered, which were rejected and why, and which one was chosen. The human-readable equivalent: ```{r eval = FALSE} amatrix_explain(x %*% diag(1000)) ``` ## Residency and fallback telemetry Two other surfaces help you understand runtime behavior: - `amatrix_backend_status()` reports per-backend `health` (`unprobed` / `healthy` / `unhealthy:`), precision modes, feature flags, and capabilities. A backend that crashes during a health probe is marked `unhealthy` and never routed to again this session. - `amatrix_fallback_log()` records every time dispatch was forced to fall back from a GPU backend to CPU. After a clean conformance run it should be empty. If it isn't, something claimed support for an op it couldn't actually execute. ```{r eval = FALSE} amatrix_backend_status() amatrix_fallback_log() ``` ## Regenerating the baseline The baseline you compare against is machine-local. For day-to-day inspection you rarely touch the raw data — `amatrix_benchmark_report()` reads and summarizes it for you. To *measure* a fresh baseline (a first run, or after a hardware change), use the benchmark harness that ships in the package repository, [`tools/benchmark-regression.R`](https://github.com/bbuchsbaum/amatrix/blob/main/tools/benchmark-regression.R): ```{r eval = FALSE} # First run or after a hardware change (writes a new baseline): # Rscript tools/benchmark-regression.R --update # Subsequent runs — compare to the saved baseline: # Rscript tools/benchmark-regression.R ``` Regression against the baseline is informational on your laptop; the authoritative gate runs in the nightly workflow on a reference machine. ## Honest defaults amatrix ships with calibrated-conservative defaults: when no calibration exists, the dispatcher routes small workloads to CPU even if a GPU is available. This matches the Track 4 Speed Contract's "never silently worse than CPU" rule. Once you've calibrated, the thresholds sharpen and the GPU gets its fair share of the work. The bottom line: **you do not need to guess where amatrix is routing.** Between `amatrix_backend_plan()`, `amatrix_benchmark_report()`, and `amatrix_fallback_log()`, the runtime state is always introspectable.