
Bill McColl
Huawei Zurich Research Center, Switzerland
Supermeshes: A New Architecture for AI Datacenters
Abstract
Supermeshes have communication as the central element, rather than computation. This reflects scalable AI today, where most apps are data-centric, memory-bound and communication-bound, and require repeated rounds of massive data exchanges. We need powerful new networks and routing methods that can address this challenge, providing balanced architectures with the extremely low-latency and high-bandwidth required for optimal utilization of the costly compute nodes and memory systems.
Most networks for parallel computing have historically been uniform, and have assumed that the cost and speed of links was independent of distance. However, in designing AI architectures today, from individual SoCs and chiplets, to wafers, to personal AI devices, to servers, to racks, to large clusters, to massive AI compute factories and clouds, the links may be many orders of magnitude different in cost, speed and distance. Supermesh architectures are hierarchical – they are dense at short-reach, but can be much more sparse at long-reach. They are also virtualizable and decomposable into sub-Supermeshes, with full network isolation between them, ensuring zero congestion between apps, for total predictability, and enabling ultra-flexible optimized sharing of resources.
Biography
Bill McColl is the Director of the Computing Systems Lab at Huawei’s Zurich Research Center, where he leads research on architecture, software, AI and algorithms. He is also a Fellow of Wadham College, Oxford University. Previously he was Professor of Computer Science, Head of Research in Parallel Computing, and Chairman of the Faculty of Computer Science at Oxford. He established and led Oxford Parallel, a major center for research on industrial and business applications of HPC at the university. Much of his previous research was focused on the Bulk Synchronous Parallel (BSP) approach to parallel architecture, software and algorithms. BSP is now used throughout industry for massively parallel HPC, graph computing, machine learning and other areas of AI. His current research is focused on new architectures for AI datacenters.
If you wish to modify any information or update your photo, please contact Web Chair Arief Wicaksana.