Timer Utilities
Overview
Accurate performance measurement is fundamental to scientific computing research. This module provides two timer types with a consistent interface:
CPUTimer: OpenMP wall-clock timer for CPU code
GPUTimer: CUDA event-based timer for GPU code
Both share the same interface (start(), stop(),
elapsed_ms()) but measure time in fundamentally different execution
contexts.
CPUTimer: OpenMP Wall-Clock Timer
The CPUTimer uses omp_get_wtime(), the OpenMP wall-clock timer function.
Mathematical Basis
The wall-clock time returned by omp_get_wtime() is defined as:
The omp_get_wtime() function guarantees:
Monotonicity: \(t_{i+1} \geq t_i\) for successive calls
Consistency across threads: All threads in an OpenMP program see the same time base
Nanosecond resolution: \(\text{resolution} = \Delta t_{\text{min}}\) typically \(< 1\) ns on modern hardware
For a code segment:
The reported time is wall-clock time, not CPU time. This is critical because:
Wall-clock time is not affected by: - CPU frequency scaling (P-states, Turbo Boost) - Background processes - Context switches - Other threads in the system
CPUTimer Design
The CPUTimer is deliberately simple:
struct CPUTimer {
double start_time = 0.0;
void start(); // start_time = omp_get_wtime()
void stop(); // no-op
double elapsed_ms() const; // (omp_get_wtime() - start_time) * 1000
};
The stop() function is a no-op because elapsed_ms() computes the interval
on-the-fly:
This design allows calling elapsed_ms() multiple times after a single
start() to measure cumulative time.
GPUTimer: CUDA Event-Based Timer
GPU timing requires a fundamentally different approach due to asynchronous execution.
The Asynchronous Execution Problem
CUDA kernels are launched asynchronously:
t.start(); // records event E1 on stream
cudaLaunchKernel(...); // returns IMMEDIATELY
t.stop(); // records event E2 on stream
// at this point, kernel may NOT have finished!
double ms = t.elapsed_ms(); // WRONG if called here
Host code continues executing while the kernel runs on the GPU. Therefore,
host-side timing (even with omp_get_wtime()) measures:
where \(t_{\text{kernel}}\) is the actual GPU execution time.
CUDA Event Synchronization
CUDA events provide a mechanism to measure actual kernel time:
Events are recorded on the CUDA stream in program order
cudaEventElapsedTime()synchronizes internally:\[t_{\text{kernel}} = \text{cudaEventElapsedTime}(E_{\text{start}}, E_{\text{stop}})\]The user must call
cudaStreamSynchronize()before reading the result (thoughcudaEventElapsedTime()handles this internally)
Correct Usage Pattern
GPUTimer t;
t.start(); // records start event on stream
cudaLaunchKernel(...); // returns immediately
t.stop(); // records stop event on stream
cudaStreamSynchronize(stream); // WAIT for completion
double ms = t.elapsed_ms(); // now correct
GPUTimer Implementation Details
CUDA events are created with:
cudaEventCreate(&cuda_start_event); // opaque handle
cudaEventCreate(&cuda_stop_event);
The void* storage avoids a public dependency on <cuda_runtime.h> in
the header, following the pimpl idiom.
The start() and stop() functions record events but do not synchronize:
void GPUTimer::start() {
cudaEventRecord(cuda_start_event, 0); // stream 0 (default)
}
The elapsed_ms() function calls cudaEventElapsedTime():
Phase 1: Stub Implementation
In Phase 1, CUDA headers are not available. The GPUTimer is a no-op stub:
double GPUTimer::elapsed_ms() const { return 0.0; }
Full implementation arrives in Phase 2 when the build system links against
cuda_runtime.
API Reference
- struct CPUTimer
OpenMP wall-clock timer. Accurate across threads and unaffected by CPU frequency scaling.
- struct GPUTimer
CUDA event-based GPU timer. Measures GPU kernel execution time on a CUDA stream.
Phase 1: This is a no-op stub. Full implementation arrives in Phase 2 when CUDA headers are available.
Performance Measurement Best Practices
Warm-up runs: Discard the first 1-2 iterations to avoid cold-start effects
Multiple samples: Report the mean and standard deviation over 5-10 runs
Steady state: Ensure the system is in a stable state (no background processes)
Synchronization: Always synchronize GPU before timing (
cudaStreamSynchronize)Memory transfer: Time memory transfers separately from kernel execution
Clock gating: Be aware that GPU clock rates vary with workload (Thermal Throttling)
References
OpenMP Architecture Review Board. OpenMP Application Programming Interface, Version 5.0. https://www.openmp.org/specifications/
NVIDIA Corporation. CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/
Intel Developer Zone. Improving Performance through OpenMP (application note).