Timer Utilities

Overview

Accurate performance measurement is fundamental to scientific computing research. This module provides two timer types with a consistent interface:

CPUTimer: OpenMP wall-clock timer for CPU code
GPUTimer: CUDA event-based timer for GPU code

Both share the same interface (start(), stop(), elapsed_ms()) but measure time in fundamentally different execution contexts.

CPUTimer: OpenMP Wall-Clock Timer

The CPUTimer uses omp_get_wtime(), the OpenMP wall-clock timer function.

Mathematical Basis

The wall-clock time returned by omp_get_wtime() is defined as:

\[t_{\text{wall}} = \text{time since an epoch (typically Jan 1, 1970 UTC)}\]

The omp_get_wtime() function guarantees:

Monotonicity: \(t_{i+1} \geq t_i\) for successive calls
Consistency across threads: All threads in an OpenMP program see the same time base
Nanosecond resolution: \(\text{resolution} = \Delta t_{\text{min}}\) typically \(< 1\) ns on modern hardware

For a code segment:

\[\Delta t = \text{omp\_get\_wtime()}_{\text{after}} - \text{omp\_get\_wtime()}_{\text{before}}\]

The reported time is wall-clock time, not CPU time. This is critical because:

\[\begin{split}\text{CPU time} &= \text{wall time} \times \text{number of active cores} \times \text{utilization} \\ \text{For 4 cores at 100\% utilization:}&\ \text{CPU time} = 4 \times \text{wall time} \\ \text{For 4 cores at 25\% utilization:}&\ \text{CPU time} = 1 \times \text{wall time}\end{split}\]

Wall-clock time is not affected by: - CPU frequency scaling (P-states, Turbo Boost) - Background processes - Context switches - Other threads in the system

CPUTimer Design

The CPUTimer is deliberately simple:

struct CPUTimer {
    double start_time = 0.0;
    void start();      // start_time = omp_get_wtime()
    void stop();       // no-op
    double elapsed_ms() const; // (omp_get_wtime() - start_time) * 1000
};

The stop() function is a no-op because elapsed_ms() computes the interval on-the-fly:

\[t_{\text{elapsed}} = (\text{omp\_get\_wtime()}_{\text{now}} - \text{start\_time}) \times 1000\]

This design allows calling elapsed_ms() multiple times after a single start() to measure cumulative time.

GPUTimer: CUDA Event-Based Timer

GPU timing requires a fundamentally different approach due to asynchronous execution.

The Asynchronous Execution Problem

CUDA kernels are launched asynchronously:

t.start();                      // records event E1 on stream
cudaLaunchKernel(...);          // returns IMMEDIATELY
t.stop();                       // records event E2 on stream
// at this point, kernel may NOT have finished!
double ms = t.elapsed_ms();     // WRONG if called here

Host code continues executing while the kernel runs on the GPU. Therefore, host-side timing (even with omp_get_wtime()) measures:

\[t_{\text{host}} = t_{\text{kernel}} + t_{\text{launch}} + t_{\text{queue}} + t_{\text{transfer}}\]

where \(t_{\text{kernel}}\) is the actual GPU execution time.

CUDA Event Synchronization

CUDA events provide a mechanism to measure actual kernel time:

Events are recorded on the CUDA stream in program order
cudaEventElapsedTime() synchronizes internally:

\[t_{\text{kernel}} = \text{cudaEventElapsedTime}(E_{\text{start}}, E_{\text{stop}})\]
The user must call cudaStreamSynchronize() before reading the result (though cudaEventElapsedTime() handles this internally)

Correct Usage Pattern

GPUTimer t;
t.start();                    // records start event on stream
cudaLaunchKernel(...);       // returns immediately
t.stop();                    // records stop event on stream
cudaStreamSynchronize(stream); // WAIT for completion
double ms = t.elapsed_ms();  // now correct

GPUTimer Implementation Details

CUDA events are created with:

cudaEventCreate(&cuda_start_event);  // opaque handle
cudaEventCreate(&cuda_stop_event);

The void* storage avoids a public dependency on <cuda_runtime.h> in the header, following the pimpl idiom.

The start() and stop() functions record events but do not synchronize:

void GPUTimer::start() {
    cudaEventRecord(cuda_start_event, 0);  // stream 0 (default)
}

The elapsed_ms() function calls cudaEventElapsedTime():

\[\text{elapsed\_ms} = \text{cudaEventElapsedTime}(\text{start}, \text{stop}) \times 1000\]

Phase 1: Stub Implementation

In Phase 1, CUDA headers are not available. The GPUTimer is a no-op stub:

double GPUTimer::elapsed_ms() const { return 0.0; }

Full implementation arrives in Phase 2 when the build system links against cuda_runtime.

API Reference

struct CPUTimer: OpenMP wall-clock timer. Accurate across threads and unaffected by CPU frequency scaling.

struct GPUTimer

CUDA event-based GPU timer. Measures GPU kernel execution time on a CUDA stream.

Phase 1: This is a no-op stub. Full implementation arrives in Phase 2 when CUDA headers are available.

Performance Measurement Best Practices

Warm-up runs: Discard the first 1-2 iterations to avoid cold-start effects
Multiple samples: Report the mean and standard deviation over 5-10 runs
Steady state: Ensure the system is in a stable state (no background processes)
Synchronization: Always synchronize GPU before timing (cudaStreamSynchronize)
Memory transfer: Time memory transfers separately from kernel execution
Clock gating: Be aware that GPU clock rates vary with workload (Thermal Throttling)

References

OpenMP Architecture Review Board. OpenMP Application Programming Interface, Version 5.0. https://www.openmp.org/specifications/
NVIDIA Corporation. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/
Intel Developer Zone. Improving Performance through OpenMP (application note).