.. Timer Utilities Scientific documentation ======================== Timer Utilities =============== Overview -------- Accurate performance measurement is fundamental to scientific computing research. This module provides two timer types with a consistent interface: 1. **CPUTimer**: OpenMP wall-clock timer for CPU code 2. **GPUTimer**: CUDA event-based timer for GPU code Both share the same interface (:cpp:func:`start`, :cpp:func:`stop`, :cpp:func:`elapsed_ms`) but measure time in fundamentally different execution contexts. CPUTimer: OpenMP Wall-Clock Timer --------------------------------- The CPUTimer uses ``omp_get_wtime()``, the OpenMP wall-clock timer function. Mathematical Basis ~~~~~~~~~~~~~~~~~~ The wall-clock time returned by ``omp_get_wtime()`` is defined as: .. math:: t_{\text{wall}} = \text{time since an epoch (typically Jan 1, 1970 UTC)} The ``omp_get_wtime()`` function guarantees: 1. **Monotonicity**: :math:`t_{i+1} \geq t_i` for successive calls 2. **Consistency across threads**: All threads in an OpenMP program see the same time base 3. **Nanosecond resolution**: :math:`\text{resolution} = \Delta t_{\text{min}}` typically :math:`< 1` ns on modern hardware For a code segment: .. math:: \Delta t = \text{omp\_get\_wtime()}_{\text{after}} - \text{omp\_get\_wtime()}_{\text{before}} The reported time is **wall-clock time**, not CPU time. This is critical because: .. math:: \text{CPU time} &= \text{wall time} \times \text{number of active cores} \times \text{utilization} \\ \text{For 4 cores at 100\% utilization:}&\ \text{CPU time} = 4 \times \text{wall time} \\ \text{For 4 cores at 25\% utilization:}&\ \text{CPU time} = 1 \times \text{wall time} Wall-clock time is not affected by: - CPU frequency scaling (P-states, Turbo Boost) - Background processes - Context switches - Other threads in the system CPUTimer Design ~~~~~~~~~~~~~~~ The CPUTimer is deliberately simple: .. code-block:: cpp struct CPUTimer { double start_time = 0.0; void start(); // start_time = omp_get_wtime() void stop(); // no-op double elapsed_ms() const; // (omp_get_wtime() - start_time) * 1000 }; The ``stop()`` function is a no-op because ``elapsed_ms()`` computes the interval on-the-fly: .. math:: t_{\text{elapsed}} = (\text{omp\_get\_wtime()}_{\text{now}} - \text{start\_time}) \times 1000 This design allows calling ``elapsed_ms()`` multiple times after a single ``start()`` to measure cumulative time. GPUTimer: CUDA Event-Based Timer -------------------------------- GPU timing requires a fundamentally different approach due to **asynchronous execution**. The Asynchronous Execution Problem ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CUDA kernels are launched asynchronously: .. code-block:: cpp t.start(); // records event E1 on stream cudaLaunchKernel(...); // returns IMMEDIATELY t.stop(); // records event E2 on stream // at this point, kernel may NOT have finished! double ms = t.elapsed_ms(); // WRONG if called here Host code continues executing while the kernel runs on the GPU. Therefore, host-side timing (even with ``omp_get_wtime()``) measures: .. math:: t_{\text{host}} = t_{\text{kernel}} + t_{\text{launch}} + t_{\text{queue}} + t_{\text{transfer}} where :math:`t_{\text{kernel}}` is the actual GPU execution time. CUDA Event Synchronization ~~~~~~~~~~~~~~~~~~~~~~~~~~ CUDA events provide a mechanism to measure actual kernel time: 1. Events are recorded on the CUDA **stream** in program order 2. ``cudaEventElapsedTime()`` synchronizes internally: .. math:: t_{\text{kernel}} = \text{cudaEventElapsedTime}(E_{\text{start}}, E_{\text{stop}}) 3. The user must call ``cudaStreamSynchronize()`` before reading the result (though ``cudaEventElapsedTime()`` handles this internally) Correct Usage Pattern ~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: cpp GPUTimer t; t.start(); // records start event on stream cudaLaunchKernel(...); // returns immediately t.stop(); // records stop event on stream cudaStreamSynchronize(stream); // WAIT for completion double ms = t.elapsed_ms(); // now correct GPUTimer Implementation Details ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CUDA events are created with: .. code-block:: cpp cudaEventCreate(&cuda_start_event); // opaque handle cudaEventCreate(&cuda_stop_event); The ``void*`` storage avoids a public dependency on ```` in the header, following the **pimpl idiom**. The ``start()`` and ``stop()`` functions record events but do not synchronize: .. code-block:: cpp void GPUTimer::start() { cudaEventRecord(cuda_start_event, 0); // stream 0 (default) } The ``elapsed_ms()`` function calls ``cudaEventElapsedTime()``: .. math:: \text{elapsed\_ms} = \text{cudaEventElapsedTime}(\text{start}, \text{stop}) \times 1000 Phase 1: Stub Implementation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In Phase 1, CUDA headers are not available. The GPUTimer is a no-op stub: .. code-block:: cpp double GPUTimer::elapsed_ms() const { return 0.0; } Full implementation arrives in Phase 2 when the build system links against ``cuda_runtime``. API Reference ------------- .. cpp:type:: struct CPUTimer OpenMP wall-clock timer. Accurate across threads and unaffected by CPU frequency scaling. .. cpp:func:: void start() Record the current wall-clock time as the start point. Calls ``omp_get_wtime()`` and stores the result in ``start_time``. .. cpp:func:: void stop() No-op. Provided for API symmetry with ``GPUTimer``. CPUTimer computes elapsed on-the-fly in ``elapsed_ms()``. .. cpp:func:: double elapsed_ms() const Returns the time elapsed since ``start()`` was last called, in milliseconds. .. math:: t_{\text{ms}} = (\text{omp\_get\_wtime()}_{\text{now}} - \text{start\_time}) \times 1000 Repeated calls without a preceding ``start()`` return time since the previous ``start()``. .. cpp:type:: struct GPUTimer CUDA event-based GPU timer. Measures GPU kernel execution time on a CUDA stream. Phase 1: This is a no-op stub. Full implementation arrives in Phase 2 when CUDA headers are available. .. cpp:func:: GPUTimer() Constructs the timer. Creates the underlying ``cudaEvent_t`` handles. RAII: the timer is always valid after construction. .. cpp:func:: ~GPUTimer() Destroys the underlying CUDA events. .. cpp:func:: void start() Record the start event on the current CUDA stream. After this call, any kernel launched on the same stream will be ordered after this event. .. cpp:func:: void stop() Record the stop event on the current CUDA stream. All kernels launched before this call will be ordered before it. .. cpp:func:: double elapsed_ms() const Returns elapsed time in milliseconds between the most recent ``start()`` and ``stop()`` calls. **Important**: ``cudaStreamSynchronize()`` must have been called since ``stop()`` before reading this value. .. math:: t_{\text{ms}} = \text{cudaEventElapsedTime}(E_{\text{start}}, E_{\text{stop}}) \times 1000 Performance Measurement Best Practices -------------------------------------- 1. **Warm-up runs**: Discard the first 1-2 iterations to avoid cold-start effects 2. **Multiple samples**: Report the mean and standard deviation over 5-10 runs 3. **Steady state**: Ensure the system is in a stable state (no background processes) 4. **Synchronization**: Always synchronize GPU before timing (``cudaStreamSynchronize``) 5. **Memory transfer**: Time memory transfers separately from kernel execution 6. **Clock gating**: Be aware that GPU clock rates vary with workload (Thermal Throttling) References ---------- * OpenMP Architecture Review Board. *OpenMP Application Programming Interface*, Version 5.0. https://www.openmp.org/specifications/ * NVIDIA Corporation. *CUDA Toolkit Documentation*. https://docs.nvidia.com/cuda/ * Intel Developer Zone. *Improving Performance through OpenMP* (application note).