.. CPU SpMV Implementations
   Scientific documentation with formulas
   =======================================

CPU SpMV Implementations
========================

Overview
--------

**Sparse Matrix-Vector Multiplication (SpMV)** is the fundamental kernel in
iterative linear solvers and eigenvalue methods. Given a sparse matrix
:math:`\mathbf{A} \in \mathbb{R}^{m \times n}`, an input dense vector
:math:`\mathbf{x} \in \mathbb{R}^{n}`, and an output dense vector
:math:`\mathbf{y} \in \mathbb{R}^{m}`, SpMV computes:

.. math::

   \mathbf{y} = \mathbf{A} \cdot \mathbf{x}

Element-wise, this is:

.. math::

   y_i = \sum_{j=0}^{n-1} a_{ij} \cdot x_j \quad \forall i \in [0, m)

However, for a sparse matrix where :math:`\text{nnz} \ll m \times n`,
we only iterate over non-zero entries.

SpMV in CSR Format
------------------

For a CSR-format matrix (see :ref:`csr-format` in sparse_matrix.rst),
the SpMV operation becomes:

.. math::

   y_i = \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} 
         \text{values}[k] \cdot x_{\text{col\_index}[k]}

This is the **canonical SpMV algorithm**: for each row :math:`i`, accumulate the
dot product of the row's non-zeros with the corresponding :math:`x` entries.

Computational Complexity
~~~~~~~~~~~~~~~~~~~~~~~~

The time complexity of SpMV is:

.. math::

   T_{\text{SpMV}} = \Theta(\text{nnz})

This is optimal: every non-zero must be visited at least once.

Arithmetic Intensity
~~~~~~~~~~~~~~~~~~~~

**Arithmetic intensity** (FLOPs per byte of memory traffic) characterizes
the compute vs. memory balance:

For each non-zero, we perform:
- 1 multiplication: :math:`v \cdot x_j`
- 1 addition: accumulate into :math:`y_i`

Total: **2 FLOPs per non-zero**

Memory traffic (for CSR on CPU, streaming through once):

.. math::

   \text{reads} = \underbrace{\text{nnz} \cdot 8}_{\text{values}}
               + \underbrace{\text{nnz} \cdot 8}_{\text{col\_index}}
               + \underbrace{\text{nnz} \cdot 8}_{x\text{ (read once, cached)}}
               = 24 \cdot \text{nnz} \ \text{bytes}

   \text{writes} = \underbrace{m \cdot 8}_{y\text{ (output)}}
                 \approx 8 \cdot \text{nnz} \ \text{bytes for large } m

For typical sparse matrices with balanced rows:

.. math::

   \text{bytes per nnz} \approx \frac{24 \cdot \text{nnz}}{\text{nnz}} = 24 \ \text{B/nnz}

   \text{AI} = \frac{2 \cdot \text{nnz}}{24 \cdot \text{nnz}} = \frac{1}{12} \approx 0.083
            \ \text{FLOPs/byte}

This is **memory-bound** (AI << 1), not compute-bound. Optimization focus should
be on maximizing memory bandwidth utilization, not FLOP count.

Serial Implementation (spmv_cpu_serial)
----------------------------------------

The serial implementation uses a single thread with straightforward row-wise
iteration:

.. code-block:: cpp

   void spmv_cpu_serial(const SparseMatrix& A, const DenseVector& x, DenseVector& y) {
       y.resize(A.rows);
       for (int64_t i = 0; i < A.rows; ++i) {
           double sum = 0.0;
           for (int64_t k = A.row_ptr[i]; k < A.row_ptr[i+1]; ++k) {
               sum += A.values[k] * x.data[A.col_index[k]];
           }
           y.data[i] = sum;
       }
   }

Correctness Argument
~~~~~~~~~~~~~~~~~~~~~

**Theorem**: The serial implementation computes :math:`\mathbf{y} = \mathbf{A} \cdot \mathbf{x}` exactly.

**Proof**: For each row :math:`i`:

1. The inner loop iterates over :math:`k \in [\text{row\_ptr}[i], \text{row\_ptr}[i+1])`
2. For each :math:`k`, it adds :math:`\text{values}[k] \cdot x_{\text{col\_index}[k]}`
3. By the CSR invariant, this is precisely:
   
   .. math::

      \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} a_{i, \text{col\_index}[k]} \cdot x_{\text{col\_index}[k]}
    = \sum_{j=0}^{n-1} a_{ij} \cdot x_j

   since :math:`a_{ij} = 0` for all :math:`j \notin \{\text{col\_index}[k]\}` in the range.

4. The result is written to ``y[i]``.

By induction over all rows :math:`i = 0, \ldots, m-1`, we have :math:`y_i = (\mathbf{A} \cdot \mathbf{x})_i`
for all :math:`i`. ∎

IEEE-754 Determinism
~~~~~~~~~~~~~~~~~~~~~

The serial implementation is **deterministic** because:

1. The accumulation order is fixed: left-to-right within each row
2. IEEE-754 floating-point addition is deterministic
3. No data races (single thread)

**Lemma**: The accumulation order within a row is exactly the order of entries
in the CSR ``values`` array, which is file order (Matrix Market) preserved through
COO→CSR conversion.

OpenMP Parallel Implementation (spmv_cpu_omp)
----------------------------------------------

The parallel implementation uses OpenMP to distribute rows across CPU threads:

.. code-block:: cpp

   void spmv_cpu_omp(const SparseMatrix& A, const DenseVector& x, DenseVector& y) {
       y.resize(A.rows);
       #pragma omp parallel for schedule(static)
       for (int64_t i = 0; i < A.rows; ++i) {
           double sum = 0.0;
           for (int64_t k = A.row_ptr[i]; k < A.row_ptr[i+1]; ++k) {
               sum += A.values[k] * x.data[A.col_index[k]];
           }
           y.data[i] = sum;
       }
   }

Parallelization Strategy
~~~~~~~~~~~~~~~~~~~~~~~~~

**Row-level parallelism**: Each row :math:`i` is independent, so we distribute
rows across threads using static scheduling.

Why Static Schedule?
^^^^^^^^^^^^^^^^^^^^

The work per row varies: :math:`\text{row\_ptr}[i+1] - \text{row\_ptr}[i]` is the
number of non-zeros in row :math:`i`. For random matrices, this follows a
distribution (often Poisson or multinomial).

However, for the static scheduling of simple vector kernels like SpMV:

1. **Load imbalance is acceptable**: The variation in row lengths averages out
   over the thousands of rows typical in sparse matrices
2. **No scheduling overhead**: ``static`` scheduling has near-zero overhead
3. **Cache locality**: Consecutive iterations access consecutive memory,
   enabling hardware prefetching

Dynamic scheduling would introduce more overhead (work queue management,
atomic operations) than it would save for this memory-bound kernel.

Why Not a Reduction Clause?
^^^^^^^^^^^^^^^^^^^^^^^^^^^

A naive OpenMP reduction:

.. code-block:: cpp

   #pragma omp parallel for reduction(+:y[:A.rows])  // WRONG

The reduction clause accumulates into an **unspecified order** across threads.
For floating-point addition:

.. math::

   ((a + b) + c) + d \neq (a + b) + (c + d) \quad \text{in IEEE-754}

Different thread interleavings produce different rounding errors, resulting in
**bitwise non-deterministic output**.

Our solution: **private scalar accumulator per row**:

.. code-block:: cpp

   double sum = 0.0;  // private to each thread
   // ... accumulate into sum ...
   y.data[i] = sum;   // write once, atomically (but writes are to different indices)

Since each ``y[i]`` is written by exactly one thread (row :math:`i` belongs to
exactly one thread under static scheduling), there are no data races.

Correctness Argument
~~~~~~~~~~~~~~~~~~~~~

**Theorem**: ``spmv_cpu_omp`` produces **bitwise identical** output to ``spmv_cpu_serial``.

**Proof**: We show that for every row :math:`i`, the computation is identical:

1. Row :math:`i` is assigned to exactly one thread under static scheduling
2. That thread computes:

   .. math::

      \sum_{k=\text{row\_ptr}[i]}^{\text{row\_ptr}[i+1]-1} 
      \text{values}[k] \cdot x_{\text{col\_index}[k]}

   in exactly the same order as the serial version (sequential within thread)
3. The result is written to ``y[i]`` once, with no concurrent writes
4. No other thread reads or writes ``y[i]``

Therefore, :math:`y_i^{\text{serial}} = y_i^{\text{omp}}` for all :math:`i`. ∎

This bitwise equality is critical for the correctness test infrastructure.

Vector Utility Functions
------------------------

.. cpp:function:: void fill_zero(DenseVector& v)

   Sets all entries of a vector to 0.0.

   .. math::

      v_i = 0.0 \quad \forall i \in [0, v.\text{size})

.. cpp:function:: void fill_constant(DenseVector& v, double val)

   Sets all entries of a vector to a given value.

   .. math::

      v_i = \text{val} \quad \forall i \in [0, v.\text{size})

.. cpp:function:: double infnorm(const DenseVector& a, const DenseVector& b)

   Computes the **L-infinity norm** of the difference vector :math:`\mathbf{a} - \mathbf{b}`:

   .. math::

      \|\mathbf{a} - \mathbf{b}\|_\infty = \max_{i} |a_i - b_i|

   Used to compare two SpMV outputs for correctness.

   **Tolerance**: We consider two results "equal" if the L-inf error is below
   :math:`10^{-15}` for double-precision arithmetic. This accounts for rounding
   differences in accumulation order while catching real bugs.

   :param a: First vector (size :math:`N`)
   :param b: Second vector (size :math:`N`); must have same size as ``a``
   :return: :math:`\max_i |a_i - b_i|`

API Reference
-------------

.. cpp:function:: void spmv_cpu_serial(const SparseMatrix& A, const DenseVector& x, DenseVector& y)

   Sequential (single-threaded) CSR SpMV.

   The **canonical reference implementation**. Runs on one CPU core.
   Used as the golden reference for all correctness tests.

   **Performance**: :math:`O(\text{nnz})` memory bandwidth-bound operations.
   Typical throughput: 5–15 GB/s on a modern CPU (DDR4 or DDR5).

   :param A: Input matrix in CSR format, size :math:`m \times n`
   :param x: Input dense vector, size :math:`n`
   :param[out] y: Output dense vector, size :math:`m`. Resized automatically.

.. cpp:function:: void spmv_cpu_omp(const SparseMatrix& A, const DenseVector& x, DenseVector& y)

   OpenMP-parallel CSR SpMV.

   Parallel row-wise SpMV using OpenMP. One thread per row (or a contiguous
   chunk of rows for very large matrices).

   **Performance**: Near-linear speedup with core count for memory-bandwidth-bound
   workloads. Typical: 3.5–4× speedup on 4-core CPU.

   :param A: Input matrix in CSR format, size :math:`m \times n`
   :param x: Input dense vector, size :math:`n`
   :param[out] y: Output dense vector, size :math:`m`. Resized automatically.

Performance Characterization
----------------------------

Cache Behavior
~~~~~~~~~~~~~~

SpMV is a **streaming** operation with poor temporal locality:

1. ``values`` and ``col_index``: traversed once, in order (good spatial locality)
2. ``x.data``: read multiple times (once per non-zero in each column accessed)
3. ``y.data``: written once per row

For a matrix with :math:`\bar{n nz}` non-zeros per row (the row-length average):

.. math::

   \text{bytes per row processed} = 2 \cdot \bar{n nz} \cdot 8 \ \text{B (values + col\_index)}
                                  + \bar{n nz} \cdot 8 \ \text{B (reads of x)}
                                  + 8 \ \text{B (write of y)}

Modern CPUs can sustain ~50 GB/s bandwidth, limiting SpMV throughput to:

.. math::

   \text{throughput} \leq \frac{50 \ \text{GB/s}}{24 \ \text{B/nnz}} \approx 2 \ \text{GNNZ/s}

where "GNNZ/s" means billions of non-zeros per second.

SIMD Vectorization
~~~~~~~~~~~~~~~~~~

The inner loop is amenable to SIMD (AVX2/AVX-512) vectorization:

.. math::

   \text{AVX-512 width} = 8 \ \text{doubles per cycle}

The key is to vectorize the **multiplications** and use horizontal addition
for the reduction:

.. code-block:: text

   // Scalar
   sum += values[k] * x[col_index[k]];

   // SIMD concept (not actual code)
   vec_load values[0:7]
   vec_load x[col_index[0:7]]
   vec_mul
   vec_hadd  // horizontal add
   sum += result

This can provide 4–8× speedup on the compute portion, but memory bandwidth
remains the bottleneck.

Multithreading Roofline
~~~~~~~~~~~~~~~~~~~~~~~

For a 4-core CPU with 50 GB/s memory bandwidth:

.. code::

   Single-core throughput: ~12-15 GNNZ/s
   4-core throughput:      ~40-50 GNNZ/s (near roofline)

The parallel efficiency is high because there is no shared mutable state
in the hot loop.

References
----------

* Saad, Y. (2003). *Iterative Methods for Sparse Linear Systems* (2nd ed.). SIAM.
  Chapter 5: Basic Iterative Methods.
* Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful
  Visual Performance Model for Multi-Core Architectures. *Communications of the ACM*, 52(4), 65–76.
* Im, E., & Yelick, K. (2001). Optimizing Sparse Matrix Vector Multiplication on SMPs.
  *Lecture Notes in Computer Science*, 1175–1187.
* Gahvari, H., et al. (2006). Modeling the Performance of Sparse Matrix Vector
  Multiplication on Multicore Systems. *CLUSTER*, 1–11.