# High End MPSOC The Personal Super Computer MPSOC 2007 Conference in "Yumebutai" Awaji Island, Hyogo, Japan 25 - 29 June 2007 Tryggve Fossum CPU Architect Intel 1 #### **Disclaimer** THIS REPORT IS PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT OR BY THE SALE OF INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. Intel may make changes to specifications and product descriptions at any time, without notice. This document contains information on products in the design phase of development. The information here is subject to change This document contains information on products in the design phase of development. The information here is subject to change without notice. Do not finalize a design with this information. Intel retains the right to make changes to its test specifications at any time, without notice. Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, call (U.S.) 1-800-628-8686 or 1-916-356-3104. Data has been simulated and is provided for informational purposes only. Data was derived using simulations run on an architecture simulator. Any difference in system hardware or software design or configuration may affect actual performance. Pentium® and Xeon™ are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. \*Other names and brands may be claimed as the property of others. Copyright © 2007, Intel Corporation ## **Agenda** - Motivation for Chip Level Multiprocessing (CMP) - Success: Moore's Law - · Challenges: - Processor Core Design - Memory Access - Cache Behavior - Applications - Reliability - Power #### Single Stream, Moore's Law, and CMP CMP Performance: Performance = ~ k x Transistor Count As long as there are no Uncore limitations! Performance Gap driving us to Multi-core Single Stream Performance: Perf = $\sim \sqrt{\text{Transistor Count}}$ Historically fairly accurate Relates Moore's Law to Performance: $\sqrt{2T} * \sqrt{2} = 2\sqrt{T}$ Transistor Count **Interesting Core Design Area:** Transistor speedup due to Slopes are similar Technology shrink: 0.7 5 (Inial) ## **Teraflops Research Chip** 100 Million Transistors ● 80 Tiles ● 275mm<sup>2</sup> #### First tera-scale programmable silicon - Teraflops performance - Tile design approach - On-die mesh network - Novel clocking - Power-aware capability - Supports 3D-memory Not designed for IA or product 9 #### **Tiled Design & Mesh Network** To Future Stacked Memory **Repeated Tile Method:** North neighbor Compute + router West Modular, scalable neighbor Compute Small design teams Element Short design cycle East eighbor **Mesh Interconnect:** One tile "Network-on-a-Chip" Cores networked in a grid allows for super high bandwidth communications in and between cores 5-port, 80GB/s\* routers Low latency (1.25ns\*) Future: connect IA/or and special purpose cores ${}_{10}{}^{\!*}$ When operating at a nominal speed of 4GHz (leini) ## **Fine Grain Power Management** - Novel, modular clocking scheme saves power over global clock - New instructions to make any core sleep or wake as apps demand Chip Voltage & freq. control (0.7-1.3V, 0-5.8GHz) **Data Memory** FP Sleeping: **Engine 1** 57% less power Sleeping: Instruction 90% less Memory power Sleeping: 56% less power FP **Engine 2** Router Sleeping: Sleeping: 10% less power 90% less (stays on to power pass traffic) 21 sleep regions per tile (not all shown) Industry leading energy-efficiency of 16 Gigaflops/Watt 11 # **Multi Core System Benefits** - · Performance scaling: - On die interconnect: - Higher Bandwidth --- TB/sec vs GB's/sec - Shorter Latency --- ns's vs. 100 ns - Fast Communication with Shared cache - Better cache Hit rate - Fast Synchronization --- Locks and Barriers - Reduced false sharing - Memory - Simplifies System Design - Reduces NUMA effects - Simplifies performance tuning - Simplifies application development - Enables Fine Grained Parallelism On-die performance can grow almost linearly with core count! (leini) #### **Parallel Bioinformatics Workloads** - Structure Learning: - GeneNet Hill Climbing, Bayesian network learning - SNP Hill Climbing, Bayesian network learning - SEMPHY Structural Expectation Maximization algorithm - Optimization: - PLSA Dynamic Programming - Recognition: - SVM-RFE Feature Selection - OpenMP workloads developed by Intel Corporation - Now part of Northwestern University, NU-MineBench Suite - http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html - Also made available at: <a href="http://www.ece.umd.edu/biobench/">http://www.ece.umd.edu/biobench/</a> 19 From: [Jaleel, Mattina,...HPCA 2006] #### **Control CMP Activity for Power** - Intel can now pack more transistors on a die than reasonably power and cool at max voltage & frequency - Recall: Dynamic Power = VDD²x Cap x freq - Traditional Methods: Voltage scaling, Clock Gating - Wide variance between worst case and typical demands on power supply and cooling system - Max current flow - di/dt swings at several frequencies - Total power dissipation - Goal: Maximize performance, accounting for physical constraints - Controlling activity limits di/dt, max current draw and temp. - Take advantage of the activity constraints to run at a higher freq. 25 #### **Energy Per Instruction (EPI)** EPI looking better for Intel Architecture (IA) Just in time! (From study by Ed Grochowski, Intel, MTL. IDF Spring 2006 White Paper ftp://download.intel.com/technology/EEP/epi-trends.pdf) Energy Efficiency is key to CMP scaling ## **Summary** - Silicon integration continues to be a driving force - Multi Core is an exciting opportunity to increase performance and simplify system design - Great on-die scaling, high bandwidth, short latency, power and area efficiency - Will stress chip resources - Don't over-subscribe available power and bandwidth - Intel is working to: - Design balanced Multi Core Processor Chips - Power, Die area, Bandwidth, Caches - Analyze core sizes and functionality for different markets - Help solve the software scaling problems (leinl)