amatrix exists because some linear-algebra workloads are dramatically faster on a GPU than on a CPU. It also exists because most workloads are not. This vignette is about telling those two apart on your machine.
The short version: a GPU helps when you’re amortizing upload cost over a large amount of compute, and when the compute is GPU-shaped (dense matmul, big factorizations, iterative solvers). For everything else — small matrices, per-row operations, low arithmetic intensity, unpredictable shapes — the CPU is either faster or not worth the routing complexity.
amatrix tries very hard to route to the CPU when the GPU wouldn’t win. If you’re ever unsure whether a given call is routing where you want, the dispatcher is introspectable: ask it.
amatrix makes two promises about performance:
Never worse than CPU by more than 10% on a calibrated op. When calibration data says a backend is slower than the CPU for a given (op, size) cell, the dispatcher routes to the CPU instead. You don’t need to guess — you can check.
Regressions are stop-ship. The nightly benchmark gate fails when any op slows down by more than 20% against the recorded baseline on the reference machine, so a performance regression blocks a release the same way a correctness failure does.
Neither promise is magic — both require a baseline and a calibration on your hardware. Running them is a one-time step per machine.
Performance is machine-specific. A baseline from a MacBook M3 is not comparable to one from a Linux GPU box, and the crossover point where a GPU starts beating a CPU depends on the CPU/GPU you happen to own. The first time you use amatrix on a machine — or any time the hardware changes — run:
This measures every calibration op on every registered backend at a
set of canonical sizes, derives the minimum workload size at which each
backend beats the CPU, and persists the result to the user cache
directory. The cache is tagged with a sys_hash covering
your OS, machine, and R version; if you move the cache to a different
machine, amatrix detects the mismatch and refuses to use the stale
data.
Inspect the current calibration at any time:
amatrix_benchmark_report() surfaces the cold-vs-warm
timings from the machine-local baseline plus the calibrated thresholds
in one call:
rep <- amatrix_benchmark_report()
# Cold vs warm per op × backend, with warm/cold ratio.
head(rep$baseline)
#> op size backend cold_ms warm_ms warm_vs_cold_ratio speedup_vs_cpu
#> 1 covariance 1024x128 cpu 11.19 11.24 1.00 1
#> 2 matmul 1024x128 cpu 5.17 5.17 1.00 1
#> ...
#> 10 many_lm 1024x128 cpu 10.90 1.13 9.67 1
# Calibrated thresholds: for each (backend, op) pair, the minimum workload
# where the backend beats CPU.
head(rep$calibration)warm_vs_cold_ratio is the signal you care about
most:
many_lm, the ratio is close to 10× because
the cache stores the QR factorization on the first call and reuses it on
subsequent calls.From smallest to largest, here’s what actually happens when you call
an op on an adgeMatrix:
Small matrices (roughly < 1000 elements). Upload latency alone dwarfs the compute. The dispatcher routes to CPU. A 32×32 matmul on MLX is slower than on base BLAS, every time.
Medium matrices (10³–10⁵ elements). Depends on
the op. Dense matmul and factorizations start winning on the GPU around
512×512. Per-row reductions (rowSums, colSums)
may still be CPU-bound because they’re memory-bound, not
compute-bound.
Large matrices (> 10⁵ elements, dense). The
GPU wins decisively for %*%, crossprod,
chol, svd, qr. Factor caching
(chol_factor, svd_factor) amplifies this: the
first call pays the upload + factorize cost, subsequent calls reuse the
cached factor at memory speed.
Very large matrices (memory-pressure regime, 4K×4K+). The baseline tracks these explicitly because cache effects dominate. The CPU’s speedup-vs-GPU ratio gets worse as n grows.
Iterative algorithms (rsvd,
block_lanczos, sinkhorn, irlba).
These do many sequential kernel invocations on resident data. If the
input is already GPU-resident, they amortize the upload cost across
hundreds of kernel launches. If the input is cold every time, you’re
paying for upload over and over.
amatrix.mlx, amatrix.arrayfire,
amatrix.opencl, amatrix.metal) are separate
packages; if they’re not installed, the dispatcher routes to CPU. The
health probe in amatrix_backend_status() tells you what’s
registered and which backends passed the canary check.If you’re debugging a performance surprise, don’t guess — ask the planner what it did:
The plan shows which backends were considered, which were rejected and why, and which one was chosen. The human-readable equivalent:
Two other surfaces help you understand runtime behavior:
amatrix_backend_status() reports per-backend
health (unprobed / healthy /
unhealthy:<reason>), precision modes, feature flags,
and capabilities. A backend that crashes during a health probe is marked
unhealthy and never routed to again this session.
amatrix_fallback_log() records every time dispatch
was forced to fall back from a GPU backend to CPU. After a clean
conformance run it should be empty. If it isn’t, something claimed
support for an op it couldn’t actually execute.
The baseline you compare against is machine-local. For day-to-day
inspection you rarely touch the raw data —
amatrix_benchmark_report() reads and summarizes it for you.
To measure a fresh baseline (a first run, or after a hardware
change), use the benchmark harness that ships in the package repository,
tools/benchmark-regression.R:
# First run or after a hardware change (writes a new baseline):
# Rscript tools/benchmark-regression.R --update
# Subsequent runs — compare to the saved baseline:
# Rscript tools/benchmark-regression.RRegression against the baseline is informational on your laptop; the authoritative gate runs in the nightly workflow on a reference machine.
amatrix ships with calibrated-conservative defaults: when no calibration exists, the dispatcher routes small workloads to CPU even if a GPU is available. This matches the Track 4 Speed Contract’s “never silently worse than CPU” rule. Once you’ve calibrated, the thresholds sharpen and the GPU gets its fair share of the work.
The bottom line: you do not need to guess where amatrix is
routing. Between amatrix_backend_plan(),
amatrix_benchmark_report(), and
amatrix_fallback_log(), the runtime state is always
introspectable.