CPU SpMV Implementations

Overview

Sparse Matrix-Vector Multiplication (SpMV) is the fundamental kernel in iterative linear solvers and eigenvalue methods. Given a sparse matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\), an input dense vector \(\mathbf{x} \in \mathbb{R}^{n}\), and an output dense vector \(\mathbf{y} \in \mathbb{R}^{m}\), SpMV computes:

\[\mathbf{y} = \mathbf{A} \cdot \mathbf{x}\]

Element-wise, this is:

\[y_i = \sum_{j=0}^{n-1} a_{ij} \cdot x_j \quad \forall i \in [0, m)\]

However, for a sparse matrix where \(\text{nnz} \ll m \times n\), we only iterate over non-zero entries.

SpMV in CSR Format

For a CSR-format matrix (see CSR (Compressed Sparse Row) Format in sparse_matrix.rst), the SpMV operation becomes:

\[y_i = \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} \text{values}[k] \cdot x_{\text{col\_index}[k]}\]

This is the canonical SpMV algorithm: for each row \(i\), accumulate the dot product of the row’s non-zeros with the corresponding \(x\) entries.

Computational Complexity

The time complexity of SpMV is:

\[T_{\text{SpMV}} = \Theta(\text{nnz})\]

This is optimal: every non-zero must be visited at least once.

Arithmetic Intensity

Arithmetic intensity (FLOPs per byte of memory traffic) characterizes the compute vs. memory balance:

For each non-zero, we perform: - 1 multiplication: \(v \cdot x_j\) - 1 addition: accumulate into \(y_i\)

Total: 2 FLOPs per non-zero

Memory traffic (for CSR on CPU, streaming through once):

\[ \begin{align}\begin{aligned}\text{reads} = \underbrace{\text{nnz} \cdot 8}_{\text{values}} + \underbrace{\text{nnz} \cdot 8}_{\text{col\_index}} + \underbrace{\text{nnz} \cdot 8}_{x\text{ (read once, cached)}} = 24 \cdot \text{nnz} \ \text{bytes}\\\text{writes} = \underbrace{m \cdot 8}_{y\text{ (output)}} \approx 8 \cdot \text{nnz} \ \text{bytes for large } m\end{aligned}\end{align} \]

For typical sparse matrices with balanced rows:

\[ \begin{align}\begin{aligned}\text{bytes per nnz} \approx \frac{24 \cdot \text{nnz}}{\text{nnz}} = 24 \ \text{B/nnz}\\\text{AI} = \frac{2 \cdot \text{nnz}}{24 \cdot \text{nnz}} = \frac{1}{12} \approx 0.083 \ \text{FLOPs/byte}\end{aligned}\end{align} \]

This is memory-bound (AI << 1), not compute-bound. Optimization focus should be on maximizing memory bandwidth utilization, not FLOP count.

Serial Implementation (spmv_cpu_serial)

The serial implementation uses a single thread with straightforward row-wise iteration:

void spmv_cpu_serial(const SparseMatrix& A, const DenseVector& x, DenseVector& y) {
    y.resize(A.rows);
    for (int64_t i = 0; i < A.rows; ++i) {
        double sum = 0.0;
        for (int64_t k = A.row_ptr[i]; k < A.row_ptr[i+1]; ++k) {
            sum += A.values[k] * x.data[A.col_index[k]];
        }
        y.data[i] = sum;
    }
}

Correctness Argument

Theorem: The serial implementation computes \(\mathbf{y} = \mathbf{A} \cdot \mathbf{x}\) exactly.

Proof: For each row \(i\):

The inner loop iterates over \(k \in [\text{row\_ptr}[i], \text{row\_ptr}[i+1])\)
For each \(k\), it adds \(\text{values}[k] \cdot x_{\text{col\_index}[k]}\)
By the CSR invariant, this is precisely:

\[ \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} a_{i, \text{col\_index}[k]} \cdot x_{\text{col\_index}[k]} = \sum_{j=0}^{n-1} a_{ij} \cdot x_j\]

since \(a_{ij} = 0\) for all \(j \notin \{\text{col\_index}[k]\}\) in the range.
The result is written to y[i].

By induction over all rows \(i = 0, \ldots, m-1\), we have \(y_i = (\mathbf{A} \cdot \mathbf{x})_i\) for all \(i\). ∎

IEEE-754 Determinism

The serial implementation is deterministic because:

The accumulation order is fixed: left-to-right within each row
IEEE-754 floating-point addition is deterministic
No data races (single thread)

Lemma: The accumulation order within a row is exactly the order of entries in the CSR values array, which is file order (Matrix Market) preserved through COO→CSR conversion.

OpenMP Parallel Implementation (spmv_cpu_omp)

The parallel implementation uses OpenMP to distribute rows across CPU threads:

void spmv_cpu_omp(const SparseMatrix& A, const DenseVector& x, DenseVector& y) {
    y.resize(A.rows);
    #pragma omp parallel for schedule(static)
    for (int64_t i = 0; i < A.rows; ++i) {
        double sum = 0.0;
        for (int64_t k = A.row_ptr[i]; k < A.row_ptr[i+1]; ++k) {
            sum += A.values[k] * x.data[A.col_index[k]];
        }
        y.data[i] = sum;
    }
}

Parallelization Strategy

Row-level parallelism: Each row \(i\) is independent, so we distribute rows across threads using static scheduling.

Why Static Schedule?

The work per row varies: \(\text{row\_ptr}[i+1] - \text{row\_ptr}[i]\) is the number of non-zeros in row \(i\). For random matrices, this follows a distribution (often Poisson or multinomial).

However, for the static scheduling of simple vector kernels like SpMV:

Load imbalance is acceptable: The variation in row lengths averages out over the thousands of rows typical in sparse matrices
No scheduling overhead: static scheduling has near-zero overhead
Cache locality: Consecutive iterations access consecutive memory, enabling hardware prefetching

Dynamic scheduling would introduce more overhead (work queue management, atomic operations) than it would save for this memory-bound kernel.

Why Not a Reduction Clause?

A naive OpenMP reduction:

#pragma omp parallel for reduction(+:y[:A.rows])  // WRONG

The reduction clause accumulates into an unspecified order across threads. For floating-point addition:

\[((a + b) + c) + d \neq (a + b) + (c + d) \quad \text{in IEEE-754}\]

Different thread interleavings produce different rounding errors, resulting in bitwise non-deterministic output.

Our solution: private scalar accumulator per row:

double sum = 0.0;  // private to each thread
// ... accumulate into sum ...
y.data[i] = sum;   // write once, atomically (but writes are to different indices)

Since each y[i] is written by exactly one thread (row \(i\) belongs to exactly one thread under static scheduling), there are no data races.

Correctness Argument

Theorem: spmv_cpu_omp produces bitwise identical output to spmv_cpu_serial.

Proof: We show that for every row \(i\), the computation is identical:

Row \(i\) is assigned to exactly one thread under static scheduling
That thread computes:

\[\sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} \text{values}[k] \cdot x_{\text{col\_index}[k]}\]

in exactly the same order as the serial version (sequential within thread)
The result is written to y[i] once, with no concurrent writes
No other thread reads or writes y[i]

Therefore, \(y_i^{\text{serial}} = y_i^{\text{omp}}\) for all \(i\). ∎

This bitwise equality is critical for the correctness test infrastructure.

Vector Utility Functions

void fill_zero(DenseVector &v): Sets all entries of a vector to 0.0.

\[v_i = 0.0 \quad \forall i \in [0, v.\text{size})\]

void fill_constant(DenseVector &v, double val): Sets all entries of a vector to a given value.

\[v_i = \text{val} \quad \forall i \in [0, v.\text{size})\]

double infnorm(const DenseVector &a, const DenseVector &b)

Computes the L-infinity norm of the difference vector \(\mathbf{a} - \mathbf{b}\):

\[\|\mathbf{a} - \mathbf{b}\|_\infty = \max_{i} |a_i - b_i|\]

Used to compare two SpMV outputs for correctness.

Tolerance: We consider two results “equal” if the L-inf error is below \(10^{-15}\) for double-precision arithmetic. This accounts for rounding differences in accumulation order while catching real bugs.

Parameters:

a – First vector (size \(N\))
b – Second vector (size \(N\)); must have same size as a

Returns:

\(\max_i |a_i - b_i|\)

API Reference

void spmv_cpu_serial(const SparseMatrix &A, const DenseVector &x, DenseVector &y)

Sequential (single-threaded) CSR SpMV.

The canonical reference implementation. Runs on one CPU core. Used as the golden reference for all correctness tests.

Performance: \(O(\text{nnz})\) memory bandwidth-bound operations. Typical throughput: 5–15 GB/s on a modern CPU (DDR4 or DDR5).

Parameters:

A – Input matrix in CSR format, size \(m \times n\)
x – Input dense vector, size \(n\)

Param[out] y:

Output dense vector, size \(m\). Resized automatically.

void spmv_cpu_omp(const SparseMatrix &A, const DenseVector &x, DenseVector &y)

OpenMP-parallel CSR SpMV.

Parallel row-wise SpMV using OpenMP. One thread per row (or a contiguous chunk of rows for very large matrices).

Performance: Near-linear speedup with core count for memory-bandwidth-bound workloads. Typical: 3.5–4× speedup on 4-core CPU.

Parameters:

A – Input matrix in CSR format, size \(m \times n\)
x – Input dense vector, size \(n\)

Param[out] y:

Output dense vector, size \(m\). Resized automatically.

Performance Characterization

Cache Behavior

SpMV is a streaming operation with poor temporal locality:

values and col_index: traversed once, in order (good spatial locality)
x.data: read multiple times (once per non-zero in each column accessed)
y.data: written once per row

For a matrix with \(\bar{n nz}\) non-zeros per row (the row-length average):

\[\text{bytes per row processed} = 2 \cdot \bar{n nz} \cdot 8 \ \text{B (values + col\_index)} + \bar{n nz} \cdot 8 \ \text{B (reads of x)} + 8 \ \text{B (write of y)}\]

Modern CPUs can sustain ~50 GB/s bandwidth, limiting SpMV throughput to:

\[\text{throughput} \leq \frac{50 \ \text{GB/s}}{24 \ \text{B/nnz}} \approx 2 \ \text{GNNZ/s}\]

where “GNNZ/s” means billions of non-zeros per second.

SIMD Vectorization

The inner loop is amenable to SIMD (AVX2/AVX-512) vectorization:

\[\text{AVX-512 width} = 8 \ \text{doubles per cycle}\]

The key is to vectorize the multiplications and use horizontal addition for the reduction:

// Scalar
sum += values[k] * x[col_index[k]];

// SIMD concept (not actual code)
vec_load values[0:7]
vec_load x[col_index[0:7]]
vec_mul
vec_hadd  // horizontal add
sum += result

This can provide 4–8× speedup on the compute portion, but memory bandwidth remains the bottleneck.

Multithreading Roofline

For a 4-core CPU with 50 GB/s memory bandwidth:

Single-core throughput: ~12-15 GNNZ/s
4-core throughput:      ~40-50 GNNZ/s (near roofline)

The parallel efficiency is high because there is no shared mutable state in the hot loop.

References

Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM. Chapter 5: Basic Iterative Methods.
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multi-Core Architectures. Communications of the ACM, 52(4), 65–76.
Im, E., & Yelick, K. (2001). Optimizing Sparse Matrix Vector Multiplication on SMPs. Lecture Notes in Computer Science, 1175–1187.
Gahvari, H., et al. (2006). Modeling the Performance of Sparse Matrix Vector Multiplication on Multicore Systems. CLUSTER, 1–11.