GPU Kernels

This section documents our GPU SpMV kernel implementations, including the algorithm descriptions, mathematical frameworks, and paper references.

1. v1 — Basic CSR Row-Parallel Kernel

Paper Reference: Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix-Vector Multiplication on GPUs. SIAM SDM ‘09 / NVIDIA Technical Report NVR-2008-004.

Algorithm Description

The v1 kernel implements the canonical thread-per-row mapping from Bell & Garland ‘09. Each CUDA thread processes exactly one matrix row.

Thread Mapping

\[\text{thread\_id} = \text{block\_idx} \times \text{block\_dim} + \text{thread\_idx}\]

Row \(i\) is assigned to thread \(i\). The thread computes:

\[y_i = \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} \text{values}[k] \cdot x_{\text{col\_index}[k]}\]

CSR Kernel Algorithm

__global__ void spmv_csr_kernel(const int* row_ptr,
                                const int* col_index,
                                const double* values,
                                const double* x,
                                double* y,
                                int num_rows) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < num_rows) {
        double sum = 0.0;
        for (int k = row_ptr[i]; k < row_ptr[i+1]; ++k) {
            sum += values[k] * x[col_index[k]];
        }
        y[i] = sum;
    }
}

Memory Access Patterns

Coalesced access to values and col_index:

\[\text{thread}_j \text{ accesses} \rightarrow \text{values}[\text{base} + j]\]

Adjacent threads access adjacent memory locations, triggering coalesced global memory loads on Ampere (128-byte cache lines, 32-byte transactions).

Gather access to x:

\[x\_index = \text{col\_index}[k]\]

Accesses depend on column index distribution. Random distributions cause uncoalesced reads.

Effective Memory Bandwidth

\[B_{\text{eff}} = \frac{2 \cdot \text{nnz} \cdot \sizeof(\text{double})}{t_{\text{kernel}}}} = \frac{16 \cdot \text{nnz}}{t_{\text{ms}}} \quad \text{[GB/s]}\]

where \(t_{\text{kernel}}\) is the kernel execution time in seconds.

Limitations

Load imbalance: Warp divergence when rows have vastly different lengths

\[\text{row\_len}_i = \text{row\_ptr}[i+1] - \text{row\_ptr}[i]\]

A warp processing 32 consecutive rows may encounter row lengths from 1 to 1000+.
Low warp utilization: For rows with \(\text{row\_len}_i < 32\), some threads in the warp are idle.
No shared memory caching: Every access to x goes to global memory.

Kernel Launch Configuration

int block_size = 256;  // multi-warp block
int grid_size = (num_rows + block_size - 1) / block_size;
spmv_csr_kernel<<<grid_size, block_size>>>(...);

Recommended block sizes for Ampere:

\[\text{block\_size} \in [128, 256, 512] \quad \text{(multiples of 32)}\]

Relevance

The v1 kernel serves as the baseline implementation. It demonstrates the fundamental challenges in GPU SpMV (load imbalance, memory access patterns) that our v2 kernel addresses through optimizations.

2. v2 — Optimized Kernel

Paper References:

Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.
Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.
Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.

Overview

The v2 kernel incorporates multiple optimizations targeting the Ampere architecture. It combines adaptive row grouping for load balancing, shared memory tiling for the input vector, and non-coherent loads for efficient memory access.

Key Optimizations

Adaptive Row Grouping (Greathouse & Daga ‘14)
Shared Memory Tiling (Bell & Garland ‘09)
Non-Coherent Loads via __ldg (Chu et al. ‘23)
Tile-Based Load Balancing (Liu & Vinter ‘15)

Adaptive Row Grouping

Problem: Warp divergence from variable row lengths.

\[\text{row\_len}_i = \text{row\_ptr}[i+1] - \text{row\_ptr}[i]\]

With simple row-to-thread mapping, a warp may process rows with lengths varying from 1 to 1000+, causing severe load imbalance.

Solution: Group rows by workload into warps of similar rows:

\[\text{warp\_size} = 32 \quad \text{(one warp per group)}\]

Group construction:

Compute row lengths \(\text{row\_len}_i\)
Sort rows by length (or use histogram-based binning)
Assign 32 consecutive rows of similar length to each warp

Load Balancing Efficiency:

\[\eta_{\text{load}} = \frac{\sum_{i} \text{row\_len}_i}{\text{num\_warps} \times \max_i \text{row\_len}_i}\]

For uniform row lengths, \(\eta_{\text{load}} \approx 1.0\). For power-law distributions, naive CSR gives \(\eta_{\text{load}} \ll 1\).

Shared Memory Tiling

Problem: Repeated accesses to the input vector x go to global memory.

Solution: Cache x in shared memory. Each block loads a contiguous chunk:

\[\text{shared\_mem}[i] = x[\text{block\_id} \times \text{SHARED\_ELEMENTS} + i]\]

Memory latency comparison:

Level	Latency
L1 cache	~1–3 cycles
L2 cache	~40 cycles
Global mem	~400 cycles
Shared mem	~1–10 ns

Shared memory latency is ~1–10 ns vs. global memory ~100–400 ns.

Non-Coherent Loads (`__ldg`)

For x elements outside the shared memory range, use the __ldg intrinsic:

double x_val = __ldg(&x[col_index[k]]);

This generates ld.global.nc (non-coherent global load), which:

Bypasses the L1 cache (reduces coherency overhead)
Still uses L2 cache for reuse
Ideal for read-only data accessed multiple times

Tile-Based Load Balancing

Inspired by Liu & Vinter’s CSR5, the v2 kernel uses tiles for dynamic work distribution:

\[\mathbf{A} \rightarrow \text{tiles} \quad \text{(power-of-2 row groups)}\]

Tile size is chosen to balance:

Enough rows per tile for coalesced access
Few enough rows per tile for load balancing

\[\text{tile\_size} = 2^{\lceil \log_2(\sqrt{\text{nnz}}) \rceil}\]

Occupancy Tuning (Ampere)

Occupancy formula:

\[\text{occupancy} = \frac{\text{active\_warps}}{\text{max\_warps\_per\_SM}}\]

Higher occupancy hides memory latency:

\[\text{latency\_hidden} = \text{occupancy} \times \text{latency}_{\text{mem}}\]

Ampere memory hierarchy:

Level	Latency	Bandwidth
L1 cache	~1–3 cycles	~12,288 GB/s
L2 cache	~40 cycles	~2,000 GB/s
Global mem	~400 cycles	~1,500 GB/s

Shared Memory Configuration

Default shared memory per block:

\[\text{SHARED\_MEM\_SIZE} = 32 \ \text{KB} = 4096 \ \text{doubles}\]

Ampere limit: 48 KB per block.

Algorithm Summary

The v2 kernel processes tiles with:

Load balancing: Rows grouped by similar nnz per warp
Memory optimization: x cached in shared memory, __ldg fallback
Coalesced access: Sequential access to values and col_index
Occupancy tuning: Block size selected for maximum occupancy

Relevance

The v2 kernel addresses the limitations of v1 through:

Greathouse & Daga ‘14: Eliminates warp divergence via adaptive grouping
Liu & Vinter ‘15: Tile-based load balancing without preprocessing
Chu et al. ‘23: Efficient memory access via __ldg and shared memory

3. Performance Results

Performance Summary

Key Metrics

Effective Bandwidth:

\[B_{\text{eff}} = \frac{2 \cdot \text{nnz}}{t_{\text{ms}}} \quad \text{[GB/s]}\]

GFLOP/s (2 FLOPs per non-zero):

\[\text{GFLOP/s} = \frac{2 \cdot \text{nnz}}{t_{\text{s}}} \times 10^{-9}\]

Expected Improvements

For regular matrices (low row-length variance): - v2 ~1.2–1.5× faster than v1 (shared memory benefit)

For irregular matrices (power-law distribution): - v2 ~2–4× faster than v1 (adaptive row grouping benefit)

References

Bell, N., & Garland, M. (2009). Efficient Sparse-Matrix-Vector Multiplication on GPUs. NVIDIA Technical Report NVR-2008-004 / SIAM SDM ‘09.
Greathouse, J. L., & Daga, M. (2014). Efficient Sparse Matrix-Vector Multiplication on GPUs Using a CSR-Adaptive Storage Format. SC ‘14.
Liu, W., & Vinter, B. (2015). CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication. ICS ‘15.
Chu, G., et al. (2023). Efficient Algorithm Design of Optimizing SpMV on GPU. HPDC ‘23.

GPU Kernels

1. v1 — Basic CSR Row-Parallel Kernel

Algorithm Description

Thread Mapping

CSR Kernel Algorithm

Memory Access Patterns

Effective Memory Bandwidth

Limitations

Kernel Launch Configuration

Relevance

2. v2 — Optimized Kernel

Overview

Key Optimizations

Adaptive Row Grouping

Shared Memory Tiling

Non-Coherent Loads (__ldg)

Tile-Based Load Balancing

Occupancy Tuning (Ampere)

Shared Memory Configuration

Algorithm Summary

Relevance

3. Performance Results

Performance Summary

Key Metrics

Expected Improvements

References

Non-Coherent Loads (`__ldg`)