Andreas Herkersdorf
TU Munich, Germany
Near Memory Accelerators for Efficient Inter-Tile Communication in Distributed-Shared-Memory Architectures
Download SlidesAbstract
Data access latencies and interconnect bandwidth bottlenecks frequently represent major limiting factors for the computational effectiveness of many-core processor architectures. Processing in memory (PIM) and near memory accelerators (NMA) integrate processing functions either into the memory arrays or position processing resources for specific data manipulations as close as possible to the data memory. Benefits are higher processing-to-data locality, shorter access latencies and, in consequence, improved computational efficiency. Most PIM and NMA approaches reported in the literature are applied to shared-memory architectures. This presentation introduces two near memory accelerators for efficient inter-task communication and graph data structure copy operations enabling scalable application processing in distributed-shared-memory many-core platforms. Although physically distributed memories and processing nodes reduce access hot-spots and latencies, the data-to-task locality issue still remains when applications are spread across multiple compute tiles. These architectures are in need of efficient mechanisms for inter-tile communication (ITC), thread synchronization and data transport. Common communication patterns of parallel applications, libraries, and operating systems require the transfer of arbitrary data to remote tiles for subsequent processing. We propose a software-defined, hardware-managed queue concept that enables efficient, low-latency inter-tile communication by facilitating multi-producer multi-consumer queues with arbitrary sized and structured application-specific queue elements. Queues are maintained in tile local SRAM memories and managed by an NMA unit referred to as queue handler. The queue handler initiates intra- and inter-tile DMA transfers and takes care of memory / queue management. Queues can be flexibly and dynamically created at runtime by software (software-defined). As an example use-case, we integrated the concept into the MPI library. The evaluation with NAS benchmarks shows a reduction in execution time by up to 48% for the communication intense IS kernel in a 4x4 tile design on an FPGA platform with a total of 80 LEON3 cores. We further developed an NMA unit that takes care of copying arbitrary graph data structures to enable an alternative form of efficient inter-tile and inter-thread communication. The CPUs are relieved of the costly and memory intensive graph transfer and address pointer transformation operations by outsourcing them to the graph-copy NMA unit, thus saving CPU time, NoC bandwidth and power consumption. The integration into system software benefits a wide range of applications.
Biography
Andreas Herkersdorf is a professor in the Department of Electrical and Computer Engineering and also adjunct to the Department of Informatics at Technical University of Munich (TUM). He received the Dipl.-Ing. degree from TUM in 1987 and the Dr. degree from ETH Zurich, Switzerland, in 1991, both in electrical engineering. Between 1988 and 2003, he has been in technical and management positions with the IBM Research Laboratory in Rueschlikon, Switzerland. Since 2003, Dr. Herkersdorf is director of the Chair for Integrated Systems at TUM. He is a senior member of the IEEE, member of the DFG (German Research Foundation) Review Board and serves as editor for Springer and Elsevier journals for design automation and communications electronics. His research interests include application-specific multi-processor architectures, IP network processing, Network on Chip, system level SoC modeling and design space exploration methods, and self-adaptive fault-tolerant computing.