



Delivering science and technology to protect our nation and promote world stability



#### Using Statistical Methods to Validate Hardware Performance Monitors

Brian J Gravelle Mentor: Dave Nystrom HPC-ENV Aug. 13 2020



# What are Hardware Performance Monitors?

- HPMs are used for performance analysis
  - Hardware Counters
  - Bridge the gap between application and hardware
  - Cover numerous features of the hardware
  - Count events that occur while an application runs



# How are they used?

- Many ways
- Usually ad hoc
- Specific issues are investigated
  - IPC
  - Cache miss rates
  - FLOP counts
  - Memory Bandwidth



# **Top-Down Method**

• An alternative organized method for using HPMs



нібн

RFORMANCE

## **Top-Down Method**

- Top Down was designed for Intel XEON Systems
- But there are many other vendors
- Systems of interest
  - AMD Rome
  - ARM/Marvell Thunder
  - ARM/Fujitsu A64FX
- Will Top Down work?



# **Portability of HPMs**

- Portability of applications is increasingly important
  - Consistent metrics across systems would help
- Challenges
  - Vendors have different counter designs
  - Different system architectures
  - Inconsistent correctness



# **Solutions**

- Benchmarks with known problems
  - These must be problems common to all or most systems
  - Verify the counters used to identify problems
- These benchmarks provide two things
  - Common language for developers to use across systems
  - Methods to validate measurements



## Benchmark

for elem in array
load element
sum = 0
for FLOPS per load
 sum += elem
store sum to elem





## Results

#### Educated guessing of counters was unsuccessful



## Results

- How do we find the counters we need?
- Why not try them all?
  - About 50 to 60 of interest on Rome
  - Compare results to the Skylake Memory boundness
  - Linear Regression



### Results

#### Linear regression to test all of the counters

R squared for different counters relative to Skylake Memory Boundness

PERF COUNT HW CACHE L1D ACCESS DISPATCH\_RESOURCE\_STALL\_CYCLES\_1\_LOAD\_QUEUE LS DISPATCH LD ST DISPATCH **RETIRED\_INSTRUCTIONS** PERF\_COUNT\_HW\_STALLED\_CYCLES\_BACKEND **RETIRED BRANCH INSTRUCTIONS** RETIRED TAKEN BRANCH INSTRUCTIONS  $\mathbf{0}$ 



## **Future Work**

- Improve benchmark and fitting
- Add benchmarks for other performance issues
- Explore other architectures
- Apply to real application





Over 70 years at the forefront of supercomputing

Contact: Brian J Gravelle, gravelle@lanl.gov



#### Over 70 years at the forefront of supercomputing