Paper Summaries

This section provides detailed summaries of the key papers informing our GPU SpMV implementation, including core contributions, mathematical frameworks, and formulas.


1. Bell & Garland (SC ‘09) — Foundational CSR SpMV on GPUs

Reference: Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix–Vector Multiplication on GPUs. SIAM Conference on Data Mining (SDM ‘09) — also published as NVIDIA Technical Report NVR-2008-004.

Core Contribution First systematic implementation of SpMV on CUDA GPUs. Established the canonical thread-to-row mapping for CSR format and demonstrated that GPU SpMV outperforms CPU by 3–10× for large sparse matrices.

CSR Row-Parallel Kernel

Each CUDA thread handles one matrix row:

\[y_i = \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} \text{values}[k] \cdot x_{\text{col\_index}[k]}\]

Thread mapping:

\[\text{thread\_id} = \text{block\_idx} \times \text{block\_dim} + \text{thread\_idx}\]

Memory Coalescing

For CSR layout, adjacent threads access adjacent values and col_index entries:

\[\text{thread}_j \text{ accesses} \rightarrow \text{values}[\text{base} + j]\]

This sequential access pattern triggers coalesced global memory loads on Ampere (128-byte cache lines, 32-byte transactions).

Shared Memory tiling

Hot data is cached in shared memory to reduce global memory traffic:

\[\text{shared\_mem}[i] = \text{values}[i] \quad \text{(loaded per warp)}\]

Shared memory latency: ~1–10 ns vs. global memory: ~100–400 ns.

Warp Utilization

For rows with very few non-zeros, warps are underutilized:

\[\text{active\_threads}_i = \text{row\_ptr}[i+1] - \text{row\_ptr}[i]\]

When \(\text{active\_threads}_i < 32\), some threads in the warp are idle.

Relevance to This Project

Our v1 kernel directly follows this design. The thread-to-row mapping and coalesced memory access patterns are the baseline we optimize against in v2.

Key Formula: Effective Memory Bandwidth

\[B_{\text{eff}} = \frac{2 \cdot \text{nnz} \cdot \sizeof(\text{double})}{t_{\text{kernel}}} = \frac{16 \cdot \text{nnz}}{t_{\text{ms}}} \quad \text{[GB/s]}\]

where \(t_{\text{kernel}}\) is the kernel execution time in seconds.


2. Greathouse & Daga (SC ‘14) — CSR-Adaptive

Reference: Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.

Core Contribution Addressed load imbalance in CSR SpMV for irregular matrices (wide variance in row lengths). Proposed adaptive row grouping to improve warp utilization.

Problem: Warp Divergence

With simple row-to-thread mapping:

\[\text{row\_len}_i = \text{row\_ptr}[i+1] - \text{row\_ptr}[i]\]

A warp processing 32 consecutive rows may encounter:

\[\text{row\_len} \in [1, 1000+] \quad \text{(huge variance)}\]

Long rows cause threads to stall, wasting GPU resources.

CSR-Adaptive Row Grouping

Group rows by workload into warps of similar rows:

\[\text{warp\_size} = 32 \quad \text{(one warp per group)}\]

Group construction:

  1. Compute row lengths \(\text{row\_len}_i\)

  2. Sort rows by length (or use histogram-based binning)

  3. Assign 32 consecutive rows of similar length to each warp

The offset array row_ptr is augmented with warp-start markers.

Key Formula: Load Balancing Efficiency

\[\eta_{\text{load}} = \frac{\sum_{i} \text{row\_len}_i}{\text{num\_warps} \times \max_i \text{row\_len}_i}\]

For uniform row lengths, \(\eta_{\text{load}} \approx 1.0\). For power-law distributions, naive CSR gives \(\eta_{\text{load}} \ll 1\).

Relevance to This Project

Our v2 kernel incorporates adaptive row grouping to handle the irregular SuiteSparse matrices in our evaluation dataset. This is the primary optimization for the “irregular” matrices in our benchmark.


3. Liu & Vinter (ICS ‘15) — CSR5

Reference: Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.

Core Contribution Proposed CSR5, a storage format and algorithm that achieves load balancing without preprocessing the row structure. Uses a segmented prefix sum approach.

CSR5 Data Layout

CSR5 adds two auxiliary arrays:

  • tile_ptr: pointer to the start of each tile (group of rows)

  • tile_desc: per-tile metadata (offset, length, etc.)

\[\mathbf{A} \rightarrow \text{tiles} \quad \text{(power-of-2 row groups)}\]

Tile size is chosen to be a power of 2 for efficient bitwise operations.

Key Algorithm: Segmented Prefix Sum

\[\text{output\_offset}[i] = \sum_{j=0}^{i} \text{nnz}_j\]

But with segment boundaries at tile edges:

\[\text{tile\_ptr}[t] = \min\{k : \sum_{i \in \text{tile}_t} \text{nnz}_i \leq \text{threshold}\}\]

Key Formula: Tile Size Selection

\[\text{tile\_size} = 2^{\lceil \log_2(\sqrt{\text{nnz}}) \rceil}\]

This balances between:

  1. Enough rows per tile for coalesced access

  2. Few enough rows per tile for load balancing

Relevance to This Project

CSR5’s segmented prefix sum influenced our v2 kernel’s dynamic row grouping. The tile-based view provides an alternative to Greathouse’s histogram approach.


4. Chu et al. (HPDC ‘23) — Ampere-Aware Optimization

Reference: Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.

Core Contribution Systematic optimization targeting NVIDIA Ampere architecture (RTX 30 series, A100). Covers warp scheduling, memory access patterns, and occupancy tuning.

Ampere Memory Hierarchy

Level

Latency

Bandwidth

L1 cache

~1–3 cycles

~12,288 GB/s

L2 cache

~40 cycles

~2,000 GB/s

Global mem

~400 cycles

~1,500 GB/s

SpMV is memory-bound, so optimization focus is on:

  1. Reducing global memory accesses

  2. Improving L1/L2 cache hit rates

  3. Using non-coherent loads for read-only data

Non-Coherent Global Loads

For the input vector x (read-only, reused across rows):

x_val = __ldg(&x[col_index[k]]);  // bypasses cache, reduces coherency overhead

The __ldg intrinsic generates ld.global.nc (non-coherent global load).

Key Formula: Occupancy vs. Performance

\[\text{occupancy} = \frac{\text{active\_warps}}{\text{max\_warps\_per\_SM}}\]

Higher occupancy hides memory latency better:

\[\text{latency\_hidden} = \text{occupancy} \times \text{latency}_{\text{mem}}\]

Recommended block sizes for CSR SpMV on Ampere:

\[\text{block\_size} \in [128, 256, 512] \quad \text{(multiples of 32)}\]

Dynamic Warp Scheduling

Ampere’s warp scheduler can issue instructions from multiple warps simultaneously. For irregular CSR:

\[\text{SIMD\_efficiency} = \frac{\text{active\_threads}}{\text{warp\_size} \times \text{warps\_per\_block}}\]

Relevance to This Project

Our v2 kernel directly incorporates Ampere-aware optimizations: non-coherent loads for x, occupancy-tuned block size, and dynamic warp scheduling for load balancing.


5. Merrill & Garland (SC ‘16) — Merge-Based SpMV

Reference: Merrill, D., & Garland, M. (2016). Merge-based Parallel Sparse Matrix-Vector Multiplication. SC ‘16.

Core Contribution Used merge path parallelism (originally for graph algorithms) to improve load balancing for irregular matrices without preprocessing.

Merge Path SpMV

Treat SpMV as a merge of two sorted sequences:

  1. Row non-zero lists (sorted by column index)

  2. Input vector elements (sorted by row index)

The merge path divides work among threads with minimal synchronization.

Key Formula: Merge Path Load Balance

\[\text{work\_per\_thread}_i = \frac{\text{nnz}}{\text{num\_threads}} \pm \sigma\]

where \(\sigma\) is the variance in row lengths.

Compared to naive row partitioning:

\[\frac{\sigma_{\text{merge}}}{\sigma_{\text{row}}}} < 0.1 \quad \text{(for power-law matrices)}\]

Relevance to This Project

Merge path parallelism provides a theoretical upper bound for load balancing efficiency. Our v2 kernel’s adaptive grouping approaches this bound.


6. Niu et al. (IPDPS 2021) — Tiled SpMV

Reference: Niu, J., et al. (2021). Tiled SpMV for Local Structure Exploitation. IPDPS 2021.

Core Contribution Exploited local structure in sparse matrices (dense subregions) using tile-based organization. Tiles with dense internal structure use optimized inner kernels.

Tile Classification

\[\text{tile\_density}_t = \frac{\text{nnz}_t}{\text{rows}_t \times \text{cols}_t}\]

Tiles are classified as:

  • Sparse tile: \(\text{tile\_density}_t < \tau\) (threshold)

  • Dense tile: \(\text{tile\_density}_t \geq \tau\)

Key Formula: Optimal Tile Size

\[\tau = \frac{1}{\text{vector\_length}} \quad \text{(typically } \tau \approx 0.01 \text{)}\]

Dense tiles use vectorized inner products; sparse tiles use standard CSR.

Relevance to This Project

If our SuiteSparse matrices contain dense subregions, tiled SpMV could be applied as a hybrid optimization for our v2 kernel. The tile concept complements CSR5’s tile_ptr approach.


7. Gao et al. (2024) — Systematic Literature Survey

Reference: Gao, J., Liu, B., Ji, W. and Huang, H. (2024). A Systematic Literature Survey of Sparse Matrix-Vector Multiplication. arXiv:2404.06047.

Core Contribution Comprehensive survey of SpMV research (2010–2024), categorizing optimizations by:

  1. Storage format (CSR, COO, ELL, HYB, JDS, SELL-C, CSR5)

  2. Hardware target (GPU, CPU, accelerator, distributed)

  3. Matrix structure (regular, irregular, power-law, block-structured)

Taxonomy of GPU SpMV Optimizations

Category

Techniques

Memory

Coalescing, tiling, prefetching, compression

Compute

Vectorization, warp tiling, unrolling

Load balancing

Row grouping, dynamic scheduling, merge path

Format

Hybrid formats, adaptive format selection

Relevance to This Project

This survey provides the broader context for our design decisions. Our v1/v2 kernel classification aligns with the “basic CSR” → “CSR-Adaptive” trajectory in the literature.


Summary Table: Papers vs. Optimizations

References

  • Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix–Vector Multiplication on GPUs. NVIDIA Technical Report NVR-2008-004 / SIAM SDM ‘09.

  • Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.

  • Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.

  • Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.

  • Merrill, D., & Garland, M. (2016). Merge-based Parallel Sparse Matrix-Vector Multiplication. SC ‘16.

  • Niu, J., et al. (2021). Tiled SpMV for Local Structure Exploitation. IPDPS 2021.

  • Gao, J., Liu, B., Ji, W. & Huang, H. (2024). A Systematic Literature Survey of Sparse Matrix-Vector Multiplication. arXiv:2404.06047.