# On the analysis of virtual platform generated traces

#### Frédéric Pétrot and Marcos Cunha

Laboratoire TIMA / System Level Synthesis Group Université Grenoble Alpes







| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | 0000   | oo       |                  | o          |
| Outline |        |          |                  |            |











| Context | Traces                                | Analysis | Experimentations | Conclusion |
|---------|---------------------------------------|----------|------------------|------------|
| ●○○     |                                       | oo       | 00000            | o          |
|         | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 |          |                  |            |

# New architectures, new challenges

- "The processor is the NAND gate of the future", dixit Chris Rowen
- Not quite there yet, but getting close, ...

| Context<br>●○○ | Traces | Analysis<br>00 | Experimentations | Conclusion<br>o |
|----------------|--------|----------------|------------------|-----------------|
|                |        |                |                  |                 |

# New architectures, new challenges

- "The processor is the NAND gate of the future", dixit Chris Rowen
- Not quite there yet, but getting close, ...



Tilera (Tile-Mx)

- 100 processors on a chip
- 64-bit ARM
- Chipwide hardware cache coherency
- Power Consumption  $\approx 100 \text{ W}$

| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| ●○○     | 0000   | 00       |                  | o          |
|         |        |          |                  |            |

# New architectures, new challenges

- "The processor is the NAND gate of the future", dixit Chris Rowen
- Not quite there yet, but getting close, ...

Kalray (MPPA - Bostan)

- 256 processors on chip (16 clusters of 16 PE)
- 64-bit 3-issue VLIW
- Caches but no hardware cache coherency at all
- Power Consumption  $\approx$  25 W



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     |        |          |                  |            |
|         |        |          |                  |            |

# Some SW bugs in *Multi/Many core* architectures

## Architecture specific bugs

- Hardware Software integration mismatches
- Bad understanding or wrong usage of specific mechanisms by the application/os developper
- Typical example: access to cached variable that may have been modified

# Functional bugs

- Due (mainly) to parallel execution
- Potentially sporadic since depending on the execution order (non determinism)

 Context
 Traces
 Analysis
 Experimentations
 Conclusion

 oo
 0000
 00
 00000
 0
 0

# Trace based debug and analysis



Context<br/>ocoTraces<br/>ocoAnalysis<br/>ocoExperimentations<br/>ococoConclusion<br/>oTrace based debug and analysis



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ●○○○   | oo       |                  | O          |
| Traces  |        |          |                  |            |

- Set of events giving a view of the system behavior
- Usually generated per component
- Additional relations necessary

| Event ID | Component ID | Type of Event | Cycle Number | Data            |
|----------|--------------|---------------|--------------|-----------------|
| 1        | CPU_1        | INSTRUCTION   | 1235678      | PC=0x000000A0   |
| 2        | CPU_2        | INSTRUCTION   | 1235679      | PC=0x000000B0   |
| 3        | MEMORY_1     | READ          | 1235680      | ADDR=0xDEADBEEF |
| 4        | MEMORY_1     | READ          | 1235781      | ADDR=0xDEADBEEF |
|          |              |               |              |                 |

| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ●○○○   | oo       |                  | O          |
| Traces  |        |          |                  |            |

- Set of events giving a view of the system behavior
- Usually generated per component
- Additional relations necessary

| Event ID | Component ID | Type of Event | Cycle Number | Data            |
|----------|--------------|---------------|--------------|-----------------|
| 1        | CPU_1        | INSTRUCTION   | 1235678      | PC=0x000000A0   |
| 2        | CPU_2        | INSTRUCTION   | 1235679      | PC=0x000000B0   |
| 3        | MEMORY_1     | READ          | 1235680      | ADDR=0xDEADBEEF |
| 4        | MEMORY_1     | READ          | 1235781      | ADDR=0xDEADBEEF |
|          |              |               |              |                 |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ●○○○   | oo       |                  | O          |
| Traces  |        |          |                  |            |

- Set of events giving a view of the system behavior
- Usually generated per component
- Additional relations necessary

| Event ID | Component ID | Type of Event | Cycle Number | Data            |
|----------|--------------|---------------|--------------|-----------------|
| 1        | CPU_1        | INSTRUCTION   | 1235678      | PC=0x000000A0   |
| 2        | CPU_2        | INSTRUCTION   | 1235679      | PC=0x000000B0   |
| 3        | MEMORY_1     | READ          | 1235680      | ADDR=0xDEADBEEF |
| 4        | MEMORY_1     | READ          | 1235781      | ADDR=0xDEADBEEF |
|          |              |               |              |                 |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ●○○○   | oo       |                  | O          |
| Traces  |        |          |                  |            |

- Set of events giving a view of the system behavior
- Usually generated per component
- Additional relations necessary

| Event ID | Component ID | Type of Event | Cycle Number | Data            |
|----------|--------------|---------------|--------------|-----------------|
| 1        | CPU_1        | INSTRUCTION   | 1235678      | PC=0x000000A0   |
| 2        | CPU_2        | INSTRUCTION   | 1235679      | PC=0x000000B0   |
| 3        | MEMORY_1     | READ          | 1235680      | ADDR=0xDEADBEEF |
| 4        | MEMORY_1     | READ          | 1235781      | ADDR=0xDEADBEEF |
|          |              |               |              |                 |



| Context | Traces    | Analysis | Experimentations | Conclusion |
|---------|-----------|----------|------------------|------------|
| 000     | ○●○○      | 00       |                  | o          |
| Trace d | efinition |          |                  |            |

- Goal :
  - Capturing traces representing the parallel system behavior
- Définition : *T* = (*E*, <, ↔, <)</li>
  - T : Traces
  - E : Events
  - Relations :
    - <: Strict total order of event within a given component
    - ← : Causality between events belonging to different components
    - < : Sytem total order based on a *shared* component
- Important feature:
  - No timestamping

| Context  | Traces      | Analysis | Experimentations | Conclusion |
|----------|-------------|----------|------------------|------------|
| 000      | ○o●o        | oo       |                  | O          |
| Trace re | presentatio | า        |                  |            |

time



| Context  | Traces       | Analysis | Experimentations | Conclusion |
|----------|--------------|----------|------------------|------------|
| 000      | ○o●o         | 00       |                  | o          |
| Trace re | presentation | า        |                  |            |



| Context  | Traces       | Analysis | Experimentations | Conclusion |
|----------|--------------|----------|------------------|------------|
| 000      | ○o●o         | 00       |                  | o          |
| Trace re | presentation | l        |                  |            |



| Context  | Traces       | Analysis | Experimentations | Conclusion |
|----------|--------------|----------|------------------|------------|
| 000      | ○○●○         | 00       |                  | o          |
| Trace re | presentation | า        |                  |            |



| Context  | Traces       | Analysis | Experimentations | Conclusion |
|----------|--------------|----------|------------------|------------|
| 000      | ○○●○         | 00       |                  | o          |
| Trace re | presentation | า        |                  |            |





| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ⊙o●o   | 00       |                  | o          |
| -       |        |          |                  |            |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ○o●o   | 00       |                  | o          |
|         |        |          |                  |            |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ○o●o   | 00       |                  | o          |
|         |        |          |                  |            |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ○o●o   | 00       |                  | o          |
|         |        |          |                  |            |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ○o●o   | 00       |                  | O          |
|         |        |          |                  |            |



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
| 000     | ○○○●   | 00       |                  | O          |
|         |        |          |                  |            |

# Forward and Rewind operations

- Goal
  - Assign total order to causality chains without shared component
- Operations
  - Rewind (<<>>) : Total order assign to previous shared event
  - Forward (>>>): Total order assign to next shared event



| Context | Traces | Analysis | Experimentations | Conclusion |
|---------|--------|----------|------------------|------------|
|         |        |          |                  |            |
|         |        |          |                  |            |

# Trace based cache-coherence analysis

- Goal :
  - Detect cache coherence issues in SW cache coherence protocols
- Example :
  - Write-through
    - Reads : from cache or memory
    - Write : always into memory, also in cache on hit
- Formalize problem as a graph analysis per memory block
- Express set of rules that check for violation

| Context  | Traces     | Analysis | Experimentations | Conclusion |
|----------|------------|----------|------------------|------------|
| 000      | 0000       | 00       | 00000            | 0          |
| write-th | rough rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)

● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\nexists L_j$  such that  $L_j \leftarrow A_l$ 

| Context  | Traces            | Analysis | Experimentations | Conclusion |
|----------|-------------------|----------|------------------|------------|
| 000      | 0000              | ●○       |                  | o          |
| write-th | <i>rough</i> rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)



● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\# L_j$  such that  $L_j \leftarrow A_l$ 

| Context  | Traces     | Analysis | Experimentations | Conclusion |
|----------|------------|----------|------------------|------------|
| 000      | 0000       | ●○       |                  | o          |
| write-th | rough rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)



● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\nexists L_j$  such that  $L_j \leftarrow A_l$ 

| Context  | Traces     | Analysis | Experimentations | Conclusion |
|----------|------------|----------|------------------|------------|
| 000      | 0000       | ●○       |                  | o          |
| write-th | rough rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)

$$\begin{array}{c} \textbf{L}_{3.1} \rightarrow \textbf{S}_{1.2} \rightarrow \textbf{L}_{1.2} \\ \textbf{A}_{4.1} \quad \textbf{A}_{4.2} \end{array}$$

● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\nexists L_j$  such that  $L_j \leftarrow A_l$ 

| Context  | Traces            | Analysis | Experimentations | Conclusion |
|----------|-------------------|----------|------------------|------------|
| 000      | 0000              | ●○       |                  | o          |
| write-th | <i>rough</i> rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)



● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\nexists L_j$  such that  $L_j \leftarrow A_l$ 

| Context   | Traces    | Analysis | Experimentations | Conclusion |
|-----------|-----------|----------|------------------|------------|
| write-thr | ouah rule |          |                  |            |

- Checks if a cache accesses "decayed" data
- Exemple :
  - 3 processors (id = 1,2,3)
  - 1 memory (id = 4)

Violation detected!



● 
$$S_i < L_j$$
  
②  $\nexists S_k$  such that  $S_i < S_k < L_j$  with  $k \in C$   
③  $\nexists L_j^*$  such that  $S_i < L_j^* < L_j$  with  $L_j^* \neq L_j$   
④  $\nexists L_j$  such that  $L_j \leftarrow A_l$ 

| Context  | Traces   | Analysis | Experimentations | Conclusion |
|----------|----------|----------|------------------|------------|
| 000      | 0000     | ○●       |                  | O          |
| False po | ositives |          |                  |            |

- Assignments due to forward ( $\ll$ ) and rewind ( $\gg$ )
- Removal
  - If problem, apply opposite operation
  - If order is identical, the problem is confirmed
  - Otherwise, we don't know
- Limitation
  - Possible false positives do exist



| Context  | Traces   | Analysis | Experimentations | Conclusion |
|----------|----------|----------|------------------|------------|
| 000      | 0000     | ○●       |                  | O          |
| False po | ositives |          |                  |            |

- Assignments due to forward ( $\ll$ ) and rewind ( $\gg$ )
- Removal
  - If problem, apply opposite operation
  - If order is identical, the problem is confirmed
  - Otherwise, we don't know
- Limitation
  - Possible false positives do exist



| Context | Traces    | Analysis | Experimentations | Conclusion |
|---------|-----------|----------|------------------|------------|
| 000     | 0000      | 00       | ●0000            | o          |
| Virtual | orototype |          |                  |            |

- Hardware
  - Rabbits simulator with enhanced trace capture
  - Processors : up to 16 Cortex-A9
- Software
  - Parallel MJPEG/Splash-2 (all pthread based)



| Context   | Traces   | Analysis | Experimentations | Conclusion |
|-----------|----------|----------|------------------|------------|
| 000       | 0000     | oo       | 0e000            | o          |
| Trace gei | neration |          |                  |            |

Simulation slowdown due to trace generation



| Context | Traces  | Analysis | Experimentations | Conclusion |
|---------|---------|----------|------------------|------------|
| 000     | 0000    | 00       |                  | o          |
| Trace a | nalvsis |          |                  |            |

Detect and correct cache coherence violations





| Context        | Traces | Analysis | Experimentations | Conclusion |  |  |  |
|----------------|--------|----------|------------------|------------|--|--|--|
| 000            | 0000   | 00       | 00000            | O          |  |  |  |
| Trace analysis |        |          |                  |            |  |  |  |

• Detect and correct cache coherence violations





# Cache coherence analysis: Violations

• Number of violations detected and corrected per program (Initial hypothesis: hardware coherent shared memory)





- Analysis time is O(k \* |E|) with  $k \ll |E|$
- Analysis is done online: Peak memory usage limited



| Context<br>000 | Traces | Analysis<br>00 | Experimentations | Conclusion |  |  |  |
|----------------|--------|----------------|------------------|------------|--|--|--|
| Conclusion     |        |                |                  |            |  |  |  |

VP produced execution traces:

- Require lots of resources
  - Take time to be generated
  - Need huge disk space to be stored
  - Need time and memory to be analysed
- But are very useful
  - Allow to obtain traces with relations between events
  - Simplifies analysis greatly: NP hard consistency model violation problem becomes linear with read/write mapping
  - · Permit online analysis for some problem