--- title: Efficient Clustered fMRI Storage with H5ParcellatedMultiScan author: fmristore Package date: '`r Sys.Date()`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Efficient Clustered fMRI Storage with H5ParcellatedMultiScan} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} params: family: red css: albers.css resource_files: - albers.css - albers.js includes: in_header: |- --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, warning = FALSE, fig.width = 7, fig.height = 4 ) if (requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("albersdown", quietly = TRUE)) ggplot2::theme_set(albersdown::theme_albers(params$family)) ``` ```{r load-packages} library(fmristore) library(neuroim2) ``` ## What is H5ParcellatedMultiScan? The `H5ParcellatedMultiScan` class stores multiple fMRI runs where brain voxels have been organized into spatial clusters (e.g., brain parcels, ROIs, or atlas regions). This approach is perfect for: - **Multi-run studies** with consistent spatial organization across runs - **ROI-based analyses** where you care about specific brain regions - **Memory efficiency** when working with large datasets - **Fast access** to time series from brain regions of interest Think of it as organizing your fMRI data like a filing cabinet: instead of having one giant drawer with all voxels mixed together, you have separate folders for each brain region, making it much faster to find what you need. ## Simple Example: From Data to Storage Let's start with the simplest possible workflow - creating some dummy data and saving it. ### Step 1: Create Brain Data ```{r create-simple-data} # Create a tiny "brain" for this example (10×10×5 voxels) brain_dims <- c(10, 10, 5) brain_space <- NeuroSpace(brain_dims, spacing = c(2, 2, 2)) # Define which voxels contain "brain tissue" (a simple box in the center) mask_data <- array(FALSE, brain_dims) mask_data[4:7, 4:7, 2:4] <- TRUE # Small 4×4×3 "brain" mask <- LogicalNeuroVol(mask_data, brain_space) cat("Our tiny brain has", sum(mask), "voxels\n") ``` ### Step 2: Create Brain Regions (Clusters) ```{r create-clusters} # Divide our brain into 3 regions n_voxels <- sum(mask) cluster_assignments <- rep(1:3, length.out = n_voxels) # Create the clustered volume clusters <- ClusteredNeuroVol(mask, cluster_assignments) # See how many voxels are in each region print(table(clusters@clusters)) ``` ### Step 3: Create Some fMRI Data ```{r create-fmri-data} # Create a 4D fMRI dataset (our 3D brain + time dimension) n_timepoints <- 50 fmri_dims <- c(brain_dims, n_timepoints) # Generate realistic-looking fMRI data (random but with some structure) fmri_data <- array(rnorm(prod(fmri_dims)), dim = fmri_dims) # Make the signal different in each cluster (so we can tell them apart) for (i in 1:3) { cluster_voxels <- which(clusters@clusters == i) mask_indices <- which(mask, arr.ind = TRUE) for (v in cluster_voxels) { x <- mask_indices[v, 1] y <- mask_indices[v, 2] z <- mask_indices[v, 3] # Add a small bias to each cluster fmri_data[x, y, z, ] <- fmri_data[x, y, z, ] + i * 0.5 } } # Convert to NeuroVec (the fmristore format for 4D data) nvec <- NeuroVec(fmri_data, NeuroSpace(fmri_dims)) cat("Created fMRI data with dimensions:", paste(dim(nvec), collapse = " × "), "\n") ``` ### Step 4: Save Using the New Simple Interface ```{r save-simple} # Save to HDF5 using the new write_dataset() function output_file <- tempfile(fileext = ".h5") # This is the new simplified way to save clustered data # Note: mask is extracted from the clusters object automatically write_dataset(nvec, file = output_file, clusters = clusters) cat("Data saved to:", basename(output_file), "\n") cat("File size:", round(file.size(output_file) / 1024, 1), "KB\n") ``` ### Step 5: Load Using the New Simple Interface ```{r load-simple} # Use the new read_dataset() function with automatic type detection experiment <- read_dataset(output_file) # Show what we got back experiment # The function automatically detected this is a parcellated experiment cat("Detected type:", class(experiment), "\n") cat("Number of runs:", n_scans(experiment), "\n") cat("Available scan names:", paste(scan_names(experiment), collapse = ", "), "\n") ``` ### Step 6: Extract and Visualize Data ```{r visualize-simple} # Get the first (and only) run my_run <- experiment@runs[[1]] # Extract time series from the first few voxels voxel_data <- series(my_run, i = 1:6) cat("Time series dimensions:", paste(dim(voxel_data), collapse = " × "), "\n") cat("(rows = time points, columns = voxels)\n\n") # Plot time series from different clusters to see they're different matplot(voxel_data[, 1:3], type = "l", main = "Time Series from Different Clusters", xlab = "Time Point", ylab = "Signal", col = c("red", "green", "blue"), lty = 1 ) legend("topright", legend = paste("Cluster", 1:3), col = c("red", "green", "blue"), lty = 1 ) ``` ## Understanding Single Scan vs Multi-Scan Containers When working with clustered neuroimaging data, it's important to understand the distinction between single scan and multi-scan containers: ### Single Scan: H5ParcellatedScanSummary - Contains data from **one** fMRI run/scan - Created by default when converting a single `ClusteredNeuroVec` object - More efficient for single-run analyses - Direct access to the data without container overhead ### Multi-Scan: H5ParcellatedMultiScan - Container that can hold **multiple** scans - Each scan is stored as an H5ParcellatedScan or H5ParcellatedScanSummary - Ideal for multi-run experiments or sessions - Provides unified access to all runs in an experiment ### Converting ClusteredNeuroVec Objects The neuroim2 package's `ClusteredNeuroVec` class naturally represents a single scan with cluster-averaged time series. You can convert it directly with `as_h5()`: - **Single scan (default):** `as_h5(cnvec, file = "out.h5")` creates an `H5ParcellatedScanSummary` - **Multi-scan container:** `as_h5(cnvec, file = "out.h5", as_multiscan = TRUE)` wraps it in an `H5ParcellatedMultiScan` For a detailed walkthrough of ClusteredNeuroVec conversion, including creating the clustering and accessing the resulting scans, see `vignette("H5ParcellatedScan")`. ### When to Use Each Format **Use Single Scan (default for ClusteredNeuroVec):** - Working with individual fMRI runs - Memory-constrained environments - Simple analyses on single datasets - When you don't need the container overhead **Use Multi-Scan Container:** - Multi-run experiments - Session-level analyses - When you need to add more scans later - Group studies with consistent parcellation ## Advanced Example: Multiple Runs with Summary Data Now let's see a more realistic example with multiple runs and both detailed and summarized data. ### Create Multiple Runs ```{r create-multi-runs} # Create 3 different fMRI runs with different characteristics run_names <- c("rest", "task1", "task2") runs_data <- list() for (i in 1:3) { # Each run has different length and characteristics n_time <- 40 + i * 10 # 50, 60, 70 timepoints run_dims <- c(brain_dims, n_time) # Generate data with run-specific patterns run_data <- array(rnorm(prod(run_dims)), dim = run_dims) # Add different activation patterns for each run if (i == 2) { # Task1: stronger activation in cluster 2 cluster_2_voxels <- which(clusters@clusters == 2) mask_indices <- which(mask, arr.ind = TRUE) for (v in cluster_2_voxels) { x <- mask_indices[v, 1] y <- mask_indices[v, 2] z <- mask_indices[v, 3] run_data[x, y, z, ] <- run_data[x, y, z, ] + sin(seq(0, 2 * pi, length.out = n_time)) } } else if (i == 3) { # Task2: stronger activation in cluster 3 cluster_3_voxels <- which(clusters@clusters == 3) mask_indices <- which(mask, arr.ind = TRUE) for (v in cluster_3_voxels) { x <- mask_indices[v, 1] y <- mask_indices[v, 2] z <- mask_indices[v, 3] run_data[x, y, z, ] <- run_data[x, y, z, ] + cos(seq(0, 2 * pi, length.out = n_time)) } } runs_data[[i]] <- NeuroVec(run_data, NeuroSpace(run_dims)) } cat( "Created", length(runs_data), "runs with", paste(sapply(runs_data, function(x) dim(x)[4]), collapse = ", "), "timepoints\n" ) ``` ### Save Multiple Runs ```{r save-multiple} # Save all runs at once using write_dataset multi_file <- tempfile(fileext = ".h5") # The new interface can handle a list of NeuroVec objects # The mask is extracted from the clusters automatically write_dataset(runs_data, file = multi_file, clusters = clusters, scan_names = run_names ) cat("Multi-run file size:", round(file.size(multi_file) / 1024, 1), "KB\n") ``` ### Automatic Type Detection ```{r type-detection} # Show off the automatic type detection feature detected_type <- detect_h5_type(multi_file) cat("Automatically detected file type:", detected_type, "\n") # This is equivalent to read_dataset() without specifying type multi_experiment <- read_dataset(multi_file) multi_experiment ``` ### Working with Multiple Runs ```{r work-with-multi} # Access each run cat("Available runs:\n") for (i in seq_along(multi_experiment@runs)) { run <- multi_experiment@runs[[i]] cat(" ", names(multi_experiment@runs)[i], ": ", paste(dim(run), collapse = " × "), " (", class(run), ")\n", sep = "" ) } # Extract data from a specific run task1_run <- multi_experiment@runs[["task1"]] task1_data <- series(task1_run, i = 1:3) # First 3 voxels # Show the activation pattern we added to task1 matplot(task1_data, type = "l", main = "Task1 Activation Pattern", xlab = "Time Point", ylab = "Signal" ) ``` ### Creating Summary Data ```{r create-summary} # Sometimes you only need averaged data per region, not individual voxels # The new interface makes this easy with the 'summary' parameter summary_file <- tempfile(fileext = ".h5") # Save just the cluster averages (much more compact) write_dataset(runs_data[[1]], # Use the first run file = summary_file, clusters = clusters, summary = TRUE, # This is the key parameter scan_name = "rest_summary" ) # Load and examine the summary data summary_experiment <- read_dataset(summary_file) summary_run <- summary_experiment@runs[[1]] # This gives us a much smaller matrix (time × clusters instead of time × voxels) summary_matrix <- as.matrix(summary_run) cat("Summary data dimensions:", paste(dim(summary_matrix), collapse = " × "), "\n") cat("(rows = time points, columns = clusters)\n") # Plot the average signal from each cluster matplot(summary_matrix, type = "l", lty = 1, col = c("red", "green", "blue"), main = "Average Signal per Brain Region", xlab = "Time Point", ylab = "Average Signal" ) legend("topright", legend = paste("Region", 1:3), col = c("red", "green", "blue"), lty = 1 ) ``` ## Key Functions Summary The new simplified interface provides three main functions: ### `write_dataset()` - Save Your Data ```{r write-examples, eval = FALSE} # Basic usage - save one fMRI run with clusters write_dataset(my_neurovec, file = "data.h5", clusters = my_clusters) # Save multiple runs at once write_dataset(list_of_neurovecs, file = "multi.h5", clusters = my_clusters, scan_names = c("run1", "run2", "run3") ) # Save only cluster averages (more compact) write_dataset(my_neurovec, file = "summary.h5", clusters = my_clusters, summary = TRUE ) # Use median instead of mean for summaries write_dataset(my_neurovec, file = "median.h5", clusters = my_clusters, summary = TRUE, summary_fun = median ) ``` ### `read_dataset()` - Load Your Data ```{r read-examples, eval = FALSE} # Automatic detection (recommended) my_data <- read_dataset("data.h5") # Specify type explicitly if needed my_data <- read_dataset("data.h5", type = "H5ParcellatedMultiScan") # Check what type a file contains file_type <- detect_h5_type("mystery_file.h5") ``` ### `detect_h5_type()` - Identify File Contents ```{r detect-examples, eval = FALSE} # Find out what's in an HDF5 file detect_h5_type("my_file.h5") # Returns: "H5ParcellatedMultiScan", "H5NeuroVec", "LatentNeuroVec", etc. ``` ## Memory Efficiency Tips The H5ParcellatedMultiScan format is designed for efficiency: ```{r efficiency-demo} # Compare file sizes cat( "Original in-memory data size per run: ~", round(object.size(runs_data[[1]]) / 1024, 1), "KB\n" ) cat( "HDF5 compressed file size: ", round(file.size(multi_file) / 1024, 1), "KB\n" ) cat( "Summary-only file size: ", round(file.size(summary_file) / 1024, 1), "KB\n" ) # Memory usage when loading cat( "\nMemory usage of loaded experiment object: ", format(object.size(multi_experiment), units = "KB"), "\n" ) # The actual data is loaded on-demand small_extraction <- series(multi_experiment@runs[[1]], i = 1:2) cat( "Memory usage of small data extraction: ", format(object.size(small_extraction), units = "KB"), "\n" ) ``` ## When to Use H5ParcellatedMultiScan **Good for:** - Multi-run studies with consistent brain parcellation - ROI-based analyses - Large datasets that don't fit in memory - Studies where you primarily analyze regional averages - Storing preprocessed data with known brain organization **Not ideal for:** - Single volumes (use `H5NeuroVol`) - Analyses requiring voxel-level flexibility across the whole brain - Data where the clustering might change between runs ## Cleanup ```{r cleanup} # Always close HDF5 connections when done close(experiment) close(multi_experiment) close(summary_experiment) # Clean up temporary files unlink(c(output_file, multi_file, summary_file)) ``` ## Next Steps - **For unclustered 4D data**: See `H5NeuroVec` documentation - **For dimensionality reduction**: Explore `LatentNeuroVec` - **For labeled brain regions**: Check out `LabeledVolumeSet` - **Package overview**: Run `help(package = "fmristore")`