Architecture Overview

fmridataset separates how data is stored from how analyses consume it. This document gives you a mental model of the three-layer stack so you can predict how components interact, choose the right class for a given task, and extend the package without modifying core code.

Layer Diagram

  ┌──────────────────────────────────────────────────────┐
  │  User layer                                          │
  │  fmri_group        fmri_study_dataset                │
  │  (multi-study)     (multi-subject)                   │
  ├──────────────────────────────────────────────────────┤
  │  Dataset layer                                       │
  │  matrix_dataset    fmri_file_dataset   latent_dataset│
  │  (in-memory)       (NIfTI / HDF5)      (embedding)   │
  ├──────────────────────────────────────────────────────┤
  │  Backend layer                                       │
  │  matrix_backend  nifti_backend  h5_backend           │
  │  zarr_backend    study_backend                       │
  └──────────────────────────────────────────────────────┘
        ↑ all backends implement the same five-method contract

Each layer only knows about the layer directly beneath it. Analysis code that calls get_data_matrix() or data_chunks() works identically whether the data lives in RAM, on disk, or in a cloud-hosted Zarr store.

The Backend Contract

Every backend must implement five generic functions:

backend_open(backend)            # open file handles / allocate resources
backend_close(backend)           # release resources
backend_get_dims(backend)        # return list(spatial = c(x,y,z), time = n)
backend_get_data(backend,        # return matrix [time x voxels]
                 rows   = NULL,
                 cols   = NULL)
validate_backend(backend)        # stop() on contract violations

rows selects timepoints and cols selects voxels; both default to “all”. Backends may add caching, memory-mapping, or chunked I/O behind those five calls without any dataset-layer changes. For a full walkthrough of writing and registering a backend see the backend-development-basics vignette.

Dataset Classes

matrix_dataset

Wraps an in-memory [time x voxels] matrix. Use it for simulated data, preprocessed ROI time series, or any situation where the full dataset fits comfortably in RAM.

ds <- matrix_dataset(
  datamat    = matrix(rnorm(100 * 500), nrow = 100, ncol = 500),
  TR         = 2.0,
  run_length = c(50, 50)
)

fmri_file_dataset

Points at NIfTI or HDF5 files without loading them. The underlying nifti_backend or h5_backend loads voxel blocks on demand, so construction is near-instantaneous even for large scan collections.

ds <- fmri_file_dataset(
  scans      = c("run1.nii.gz", "run2.nii.gz"),
  mask       = "brain_mask.nii.gz",
  TR         = 2.0,
  run_length = c(200, 200)
)

latent_dataset

Stores data in a lower-dimensional embedding space (e.g., ICA components, PCA scores) rather than voxel space. The interface is identical to the other dataset classes; only the column semantics differ.

ds <- latent_dataset(
  loadings   = component_matrix,   # voxels x components
  scores     = score_matrix,       # time   x components
  TR         = 2.0,
  run_length = c(100, 100)
)

fmri_study_dataset

Aggregates multiple single-subject datasets under one object. Subject-level data stays lazy; you iterate over subjects via data_chunks() with runwise = TRUE or pull one subject at a time.

study <- fmri_study_dataset(
  datasets    = list(sub01_ds, sub02_ds, sub03_ds),
  subject_ids = c("sub-01", "sub-02", "sub-03")
)

Temporal Structure

Every dataset carries a sampling_frame that models the acquisition timeline.

# Constructed automatically, or explicitly:
sf <- sampling_frame(
  blocklens = c(150, 150),   # run lengths in TRs
  TR        = 2.0
)

get_TR(sf)              # 2.0
get_run_lengths(sf)     # 150 and 150
get_total_duration(sf)  # 600 seconds

The sampling frame handles all timepoint-to-second and TR-index conversions, so the rest of the codebase never does raw arithmetic on timing. An event_table can be attached to any dataset; onset times are validated against the sampling frame at assignment.

Run lengths also drive the runwise chunking mode: when you request data_chunks(ds, runwise = TRUE) the iterator yields one [time x voxels] block per run, boundaries already aligned to the sampling frame.

Extension Points

Adding a New Backend

Implement the five contract functions for your new class, then register it:

my_backend <- function(source, ...) {
  structure(list(source = source, ...), class = c("my_backend", "storage_backend"))
}

backend_open.my_backend <- function(b) {
  ... # open the underlying source
}
backend_close.my_backend <- function(b) {
  ... # release resources
}
backend_get_dims.my_backend <- function(b) {
  ... # named list with spatial and time entries
}
backend_get_data.my_backend <- function(b, rows = NULL, cols = NULL) {
  ... # return a [time x voxels] matrix
}
validate_backend.my_backend <- function(b) {
  ... # check invariants
}

Once those five methods exist, your backend can be used inside any fmri_file_dataset by passing an instance via the backend argument, or inside a custom dataset subclass.

BIDS + HDF5 as a Concrete Example

fmri_file_dataset with h5_backend is the recommended pattern for preprocessed BIDS derivatives stored in HDF5. The h5_backend uses chunk-aware reads aligned to the HDF5 chunk lattice, so iterating through a BIDS cohort with data_chunks() achieves near-optimal I/O without any changes to analysis code.

Object Zoo

Class Constructor Purpose
matrix_dataset matrix_dataset() In-memory matrix, full random access
fmri_file_dataset fmri_file_dataset() Lazy NIfTI / HDF5 file access
latent_dataset latent_dataset() Embedding / component space data
fmri_study_dataset fmri_study_dataset() Multi-subject container
sampling_frame sampling_frame() Temporal structure for one session
matrix_backend matrix_backend() In-memory backend (internal)
nifti_backend nifti_backend() NIfTI file backend
h5_backend h5_backend() HDF5 backend with chunk-aware I/O
zarr_backend zarr_backend() Zarr backend for cloud-native arrays
study_backend study_backend() Multi-subject lazy backend