Paper Summaries
This section provides detailed summaries of the key papers informing our GPU SpMV implementation, including core contributions, mathematical frameworks, and formulas.
1. Bell & Garland (SC ‘09) — Foundational CSR SpMV on GPUs
Reference: Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix–Vector Multiplication on GPUs. SIAM Conference on Data Mining (SDM ‘09) — also published as NVIDIA Technical Report NVR-2008-004.
Core Contribution First systematic implementation of SpMV on CUDA GPUs. Established the canonical thread-to-row mapping for CSR format and demonstrated that GPU SpMV outperforms CPU by 3–10× for large sparse matrices.
CSR Row-Parallel Kernel
Each CUDA thread handles one matrix row:
Thread mapping:
Memory Coalescing
For CSR layout, adjacent threads access adjacent values and col_index entries:
This sequential access pattern triggers coalesced global memory loads on Ampere (128-byte cache lines, 32-byte transactions).
Warp Utilization
For rows with very few non-zeros, warps are underutilized:
When \(\text{active\_threads}_i < 32\), some threads in the warp are idle.
Relevance to This Project
Our v1 kernel directly follows this design. The thread-to-row mapping and coalesced memory access patterns are the baseline we optimize against in v2.
Key Formula: Effective Memory Bandwidth
where \(t_{\text{kernel}}\) is the kernel execution time in seconds.
2. Greathouse & Daga (SC ‘14) — CSR-Adaptive
Reference: Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.
Core Contribution Addressed load imbalance in CSR SpMV for irregular matrices (wide variance in row lengths). Proposed adaptive row grouping to improve warp utilization.
Problem: Warp Divergence
With simple row-to-thread mapping:
A warp processing 32 consecutive rows may encounter:
Long rows cause threads to stall, wasting GPU resources.
CSR-Adaptive Row Grouping
Group rows by workload into warps of similar rows:
Group construction:
Compute row lengths \(\text{row\_len}_i\)
Sort rows by length (or use histogram-based binning)
Assign 32 consecutive rows of similar length to each warp
The offset array row_ptr is augmented with warp-start markers.
Key Formula: Load Balancing Efficiency
For uniform row lengths, \(\eta_{\text{load}} \approx 1.0\). For power-law distributions, naive CSR gives \(\eta_{\text{load}} \ll 1\).
Relevance to This Project
Our v2 kernel incorporates adaptive row grouping to handle the irregular SuiteSparse matrices in our evaluation dataset. This is the primary optimization for the “irregular” matrices in our benchmark.
3. Liu & Vinter (ICS ‘15) — CSR5
Reference: Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.
Core Contribution Proposed CSR5, a storage format and algorithm that achieves load balancing without preprocessing the row structure. Uses a segmented prefix sum approach.
CSR5 Data Layout
CSR5 adds two auxiliary arrays:
tile_ptr: pointer to the start of each tile (group of rows)tile_desc: per-tile metadata (offset, length, etc.)
Tile size is chosen to be a power of 2 for efficient bitwise operations.
Key Algorithm: Segmented Prefix Sum
But with segment boundaries at tile edges:
Key Formula: Tile Size Selection
This balances between:
Enough rows per tile for coalesced access
Few enough rows per tile for load balancing
Relevance to This Project
CSR5’s segmented prefix sum influenced our v2 kernel’s dynamic row grouping. The tile-based view provides an alternative to Greathouse’s histogram approach.
4. Chu et al. (HPDC ‘23) — Ampere-Aware Optimization
Reference: Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.
Core Contribution Systematic optimization targeting NVIDIA Ampere architecture (RTX 30 series, A100). Covers warp scheduling, memory access patterns, and occupancy tuning.
Ampere Memory Hierarchy
Level |
Latency |
Bandwidth |
|---|---|---|
L1 cache |
~1–3 cycles |
~12,288 GB/s |
L2 cache |
~40 cycles |
~2,000 GB/s |
Global mem |
~400 cycles |
~1,500 GB/s |
SpMV is memory-bound, so optimization focus is on:
Reducing global memory accesses
Improving L1/L2 cache hit rates
Using non-coherent loads for read-only data
Non-Coherent Global Loads
For the input vector x (read-only, reused across rows):
x_val = __ldg(&x[col_index[k]]); // bypasses cache, reduces coherency overhead
The __ldg intrinsic generates ld.global.nc (non-coherent global load).
Key Formula: Occupancy vs. Performance
Higher occupancy hides memory latency better:
Recommended block sizes for CSR SpMV on Ampere:
Dynamic Warp Scheduling
Ampere’s warp scheduler can issue instructions from multiple warps simultaneously. For irregular CSR:
Relevance to This Project
Our v2 kernel directly incorporates Ampere-aware optimizations:
non-coherent loads for x, occupancy-tuned block size, and dynamic warp
scheduling for load balancing.
5. Merrill & Garland (SC ‘16) — Merge-Based SpMV
Reference: Merrill, D., & Garland, M. (2016). Merge-based Parallel Sparse Matrix-Vector Multiplication. SC ‘16.
Core Contribution Used merge path parallelism (originally for graph algorithms) to improve load balancing for irregular matrices without preprocessing.
Merge Path SpMV
Treat SpMV as a merge of two sorted sequences:
Row non-zero lists (sorted by column index)
Input vector elements (sorted by row index)
The merge path divides work among threads with minimal synchronization.
Key Formula: Merge Path Load Balance
where \(\sigma\) is the variance in row lengths.
Compared to naive row partitioning:
Relevance to This Project
Merge path parallelism provides a theoretical upper bound for load balancing efficiency. Our v2 kernel’s adaptive grouping approaches this bound.
6. Niu et al. (IPDPS 2021) — Tiled SpMV
Reference: Niu, J., et al. (2021). Tiled SpMV for Local Structure Exploitation. IPDPS 2021.
Core Contribution Exploited local structure in sparse matrices (dense subregions) using tile-based organization. Tiles with dense internal structure use optimized inner kernels.
Tile Classification
Tiles are classified as:
Sparse tile: \(\text{tile\_density}_t < \tau\) (threshold)
Dense tile: \(\text{tile\_density}_t \geq \tau\)
Key Formula: Optimal Tile Size
Dense tiles use vectorized inner products; sparse tiles use standard CSR.
Relevance to This Project
If our SuiteSparse matrices contain dense subregions, tiled SpMV could be applied as a hybrid optimization for our v2 kernel. The tile concept complements CSR5’s tile_ptr approach.
7. Gao et al. (2024) — Systematic Literature Survey
Reference: Gao, J., Liu, B., Ji, W. and Huang, H. (2024). A Systematic Literature Survey of Sparse Matrix-Vector Multiplication. arXiv:2404.06047.
Core Contribution Comprehensive survey of SpMV research (2010–2024), categorizing optimizations by:
Storage format (CSR, COO, ELL, HYB, JDS, SELL-C, CSR5)
Hardware target (GPU, CPU, accelerator, distributed)
Matrix structure (regular, irregular, power-law, block-structured)
Taxonomy of GPU SpMV Optimizations
Category |
Techniques |
|---|---|
Memory |
Coalescing, tiling, prefetching, compression |
Compute |
Vectorization, warp tiling, unrolling |
Load balancing |
Row grouping, dynamic scheduling, merge path |
Format |
Hybrid formats, adaptive format selection |
Relevance to This Project
This survey provides the broader context for our design decisions. Our v1/v2 kernel classification aligns with the “basic CSR” → “CSR-Adaptive” trajectory in the literature.
Summary Table: Papers vs. Optimizations
References
Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix–Vector Multiplication on GPUs. NVIDIA Technical Report NVR-2008-004 / SIAM SDM ‘09.
Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.
Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.
Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.
Merrill, D., & Garland, M. (2016). Merge-based Parallel Sparse Matrix-Vector Multiplication. SC ‘16.
Niu, J., et al. (2021). Tiled SpMV for Local Structure Exploitation. IPDPS 2021.
Gao, J., Liu, B., Ji, W. & Huang, H. (2024). A Systematic Literature Survey of Sparse Matrix-Vector Multiplication. arXiv:2404.06047.