

#### **Extending Performance Monitoring Profile Guided Optimization Capabilities**

#### Michael Chynoweth - Sr. Principal Engineer Intel Corporation

Contributors: Joe Olivas, Chris Chrulski, Patrick Konsor, Rajshree Chabukswar, Stas Bratanov, Hideki Saito, Angie Schmid, Sneha Gohad, Robert Cox, Zia Ansari, Ahmad Yasin, Lama Saba, Dorit Nuzman



### Agenda

- Today Profile Guided Optimizations are mostly impacting code/text section
  - Extensions on analysis to the text section optimizations
- Who's Interested?
- Next generation of PGO will utilize more events
  - Allow focus on the right bottleneck
- Examples of automatic profile guided optimizations with compiler
  - Decision on whether to fix a uarch bottleneck
  - Loop optimizations
  - Data reordering

#### Top Down: Our Processor is Just An Assembly Line



- Abstracts our architectures into 4 categories
  - Front End Bound
  - Back End Bound
  - Bad Speculation
  - Retiring
- Focus our efforts on the right bottlenecks



(intel)

#### **Top Down Helps Define the Primary Bottleneck**

## Everything is Driven by Top Down Optimizations

| Cost         | Performance Monitoring Events Calculation                                          |  |  |
|--------------|------------------------------------------------------------------------------------|--|--|
| <b>38.8%</b> | NO_ALLOC_CYCLES.NOT_DELIVERED/CPU_CLK_UNHALTED.CORE                                |  |  |
| 26.3%        | INST_LINE_FETCH_COST+PREDECODE_WRONG_COST                                          |  |  |
| 7.2%         | NETCH_STALL.ICACHE_FILL_PENDING_CYCLES*1/CPU_CLK_UNHALTED.CORE                     |  |  |
|              | DECODE_RESTRICTION.PDCACHE_WRONG*3/CPU_CLK_UNHALTED.CORE                           |  |  |
| 8.5% 🔨       | PAGE_WALKS.I_SIDE_CYCLES*1/CPU_CLK_UNHALTED.CORE                                   |  |  |
|              | 1-RETIRING-FRONT_END_BOUND-BAD_SPECULATION                                         |  |  |
| 12.0%        | MEM_UOPS_RETIRED.L2_MISS_LOADS_PS*230/CPU_CLK_UNHALTED.CORE                        |  |  |
| 9.0%         | PAGE_WALKS.D_SIDE_CYCLES*1/CPU_CLK_UNHALTED.CORE                                   |  |  |
| 3.6%         | NO_ALLOC_CYCLES.MISPREDICTS*1/CPU_CLK_UNHALTED.CORE                                |  |  |
|              | BR_MISP_RETIRED_ALL_BRANCHES_PS*10/CPU_CLK_UNHALTED.CORE                           |  |  |
| 13.5%        | UOPS_RETIRED.ALL*0.5/CPU_CLK_UNHALTED.CORE                                         |  |  |
|              | 38.8%<br>26.3%<br>7.2%<br>19.1%<br>8.5%<br>44.1%<br>12.0%<br>9.0%<br>3.6%<br>5.70% |  |  |

Fixed issues in red...

will cover later

Performance Monitoring Tells Where We are Bound and By How Much

#### **PGO Example Basic Block Reordering**



%FWD\_TAKEN\_JCC = (FWD\_TAKEN\_JCC-FWD\_TAKEN\_JCC\_LESSTHAN\_10BYTES)\*100/ALL\_CONDITIONAL

′inte

# LBR Already Gives Us Overall Statistics Allowing Prediction of Opportunity

Predicted using LBR

| PotentialInstructionCacheSavedPercentage | 11.6% |
|------------------------------------------|-------|
| BranchWith4kTraversalPercentage          | 36.3% |

| Statistic               | NoPGO | PGO  |
|-------------------------|-------|------|
| TotalBytesExecuted      | 69k   | 62k  |
| TotalCacheLinesExecuted | 1738  | 1373 |
| TotalCacheLinesBytes    | 109k  | 86k  |
| CacheLineEfficiency     | 64%   | 72%  |
| TotalPagesExecuted      | 182   | 93   |
| PageEfficiency          | 10%   | 17%  |

| Statistic            | No PGO | PGO | PGO/NoPGO |
|----------------------|--------|-----|-----------|
| Utilization:         | 39%    | 33% | 1.18      |
| Front End Bound Cost | 43%    | 32% |           |



# Taking Profile Guided Optimizations to Next Level

- Utilize all of performance monitoring capabilities for PGO
- Code reorganization (Already being stressed)
  - Basic block + Function reordering, Function splitting, Inlining/partial inlining
- Data profiling
  - Data structure + Data section reordering + False sharing avoidance
  - Function parameters
  - Loop pointer aliasing
  - Intelligent allocators
- Drive optimizations based on where bound in the pipeline
  - Often optimizations conflict
    - Example = "optimize for speed" and "optimize for size"
  - Loop vectorization
  - Fixing individual code generation issues



### Top Down Helps Determine Usage of Compiler Workaround for Slow LEA (LLVM Compiler)

| Issue Type | Assembly                    | ( <b>1</b> |
|------------|-----------------------------|------------|
| SLOW_LEA   | lea rax,ptr [r9+rax*1-fff1] | cution     |

| Statistics                                | SlowLEA | SlowLEA<br>Patch | SlowLEA/<br>SlowLEAPatch | Front end bottleneck |
|-------------------------------------------|---------|------------------|--------------------------|----------------------|
| Benchmark Cycles Per<br>Instruction (CPI) | 0.60    | 0.59             | 1.03                     | increases            |
| Benchmark Front End<br>Bound Cost         | 9.4%    | 10.2%            | 0.92                     | Core bound cost      |
| Benchmark Core Bound                      | 22.1%   | 17.2%            | 1.28                     | due to slow lea      |
| Benchmark Slow LEA                        | 5.7%    | 2.4%             | 2.38                     | decreases            |

# How Can Performance Monitoring PGO Help Optimize a Loop?

- Picked a couple of examples loops from benchmarks to create proof-of-concepts
- Loops were unique in that we could force them to auto-vectorize with pragmas
  - Gave us 2.6% speedup on the benchmark (on ICC or LLVM)
- Information could Performance Monitoring for PGO Provide?
  - % Cost of loop within process
    - Determines how aggressive to attempt vectorize
  - Average trip count of loop
  - Typical values in the loop
    - A value of shift in the loop is always zero
  - Pointer aliasing and data alignment
  - Total time in all vectorizable loops in the process

#### **Choosing Which Level of Vectorization to Utilize**



Inter

### Top Down and Data Reordering



#### Conclusions

- Today Profile Guided Optimizations (PGO) mostly impacting code/text section
  - Easier than impacting other vectors
- Next generation of PGO will utilize more events and capabilities
  - Determine where the instruction pipeline is bound
  - Appropriately address the appropriate bottleneck
  - Currently taking advantage of a small portion of opportunity
- Started an effort to tackle
  - Covered uarch optimization, loop optimizations and data reorganization



