Takahide Yoshikawa
Fujitsu Ltd., Japan
How do we debug and verify errors on 100,000-node systems? Insights through our supercomputer system projects
Abstract
Fujitsu has developed the world’s fastest supercomputer systems, including K Computer in 2012 and supercomputer Fugaku in 2020. Such supercomputer systems are very large and complex, with about 100,000 nodes containing over 7 million cores. Once a hardware or software bug causes an error after such a large-scale system is put into operation, identifying the root cause and correcting it should be very costly and time-consuming. For example, in a system with 100,000 nodes, just 1 MB of trace data per node would amount to more than 100 GB of data, which would require many difficulties just to analyze. In order to prevent this, it is critically essential to guarantee the operation of a large system within smaller verification environments (such as one module, one node, few nodes, and few shelves) using various kinds of verification technologies (such as verification support functions in CPU, firmware, and OS, and also specific test equipment for post-silicon validations). In this presentation, I will introduce various technologies for guaranteeing the stable operation of large-scale systems based on our experiences through the Fugaku project.
Biography
Takahide Yoshikawa is a Project Director of Next Architecture Project, Fujitsu Research at Fujitsu Ltd. He received his B.E., M.E., and Ph.D. degrees from the University of Tokyo in 1994, 1996, 2002, respectively, and he is a Senior Member of IEEE. He has been involved in various server systems projects, such as the K computer and Fugaku. In the K computer project, he proposed and implemented the whole verification, validation, and test system of its interconnect, Tofu. In Fugaku, he led the verification and validation of the CPU. Currently, he is tackling research on the architecture of the future high-performance computing system.
Download
takahide-yoshikawa.pdfIf you wish to modify any information or update your photo, please contact Web Chair Hiroki Matsutani.