Humans of Electrical Engineering, Mathematics and Computer Science

Erik Vermij

My thesis is a good introduction to where we are heading in computer architecture. It might be a paradise for those who deal with data-intensive applications.

PhD Candidate Erik Vermij will defend his phd research on Tuesday 4th July with his thesis Moving workloads to a better place - Optimizing computer architectures for data-intensive applications. He worked on novel computer systems to better deal with an ever growing amount of data. Initially for radio astronomy only, but he later broadened his focus to also include generic graph and sparse-matrix methods. Erik was supervised by Professor Bertels (Computer Engineering) and Dr. Hagleitner (IBM).

Summary of thesis

The performance of supercomputers is not growing anymore at the rate it once used to. Several years ago a break with historical trends appeared. First the break appeared at the lower end of worldwide supercomputer installations, but now it affects a significant number of systems with average performance. Power consumption is becoming the most significant problem in computer system design. The traditional power reduction trends do not apply any more for the current semiconductor technology, and the performance of general-purpose devices is limited by their power consumption. Server and system design is in turn limited by

their allowable power consumption, which is bounded for reasons of cost and practical cooling methods. To further increase performance, the use of specialized devices, in specialized server designs, optimized for a certain class of workloads, is gaining momentum. Data movement has been demonstrated to be a significant drain of energy, and is furthermore a performance bottleneck when data is moved over an interconnect with limited bandwidth. With data becoming an increasingly important asset for governments, companies, and individuals, the development of systems optimized on a device and server level for data-intensive workloads, is necessary. In this work, we explore some of the fundamentals required for such a system, as well as key use-cases.

To highlight the relevance of the work for a real-world project, we analyze the feasibility of realizing a next-generation radio-telescope, the Square Kilometre Array (SKA). We analyze the compute, bandwidth and storage requirements of the instrument, and the behavior of various important algorithms on existing products. The SKA can be considered to be the ultimate big-data challenge, and its requirements and characteristics do not fit current products. By putting the SKA requirements next to historical trends, we show that the realization of the instrument at its full capacity will not be achievable without a significant effort in the development of optimized systems.

In order to make steps towards the successful realization of the SKA, we develop a custom hardware architecture for the Central Signal Processor (CSP) subsystem of the SKA. The CSP is dominated by high input and output bandwidths, large local memories, and significant compute requirements. By means of a custom developed ASIC, connected to novel high-bandwidth memory, the proposed solution has a projected power-efficiency of 208 GFlOPS/W, while supporting all CSP kernels in a flexible way. This is an example of how optimized systems can drive down the energy consumption of workloads, and thereby aid the realization of projects with non-conventional requirements.

To enable improving the efficiency of a variety of workloads, we developed a hardware architecture supporting arbitrary processing capabilities close to the main-memory of a CPU. This follows the theme of `near-data processing', offering foremost high bandwidths and reduced data movement. The effort is driven by the two main observations that 1) processing capabilities should be workload-optimized, and 2) a focus on data and memory is necessary for modern workloads. The architectural description includes data allocation and placement, coherence between the CPU and the near-data processors (NDPs), virtual memory management, and the accessing of remote data. All data management related aspects are implemented with existing OS level NUMA functionality, and require only changes in the firmware of the system. The other three aspects are realized by means of a novel component in the memory system (NDP-Manager, NDP-M) and a novel component attached to the CPU system bus (NDP Access Point, NDP-AP). The NDP-M realizes coherence between CPU and NDP by means of a fine- and coarse-grained directory mechanism, while the NDP-AP filters unnecessary coherence traffic and prevents it from being send to the NDPs. Address translation is implemented by the NDP-M, where the Translation Lookaside Buffer (TLB) is filled and synchronized via a connection with the NDP-AP. The NDP-AP is furthermore the point where remote data accesses from the NDPs enter the global coherent address space. Several benchmarks, including a graph-traversal workload, show the feasibility of the proposed methods.

The evaluation of the architecture as well as the evaluation of various types of NDPs required the development of a novel system-simulator. The developed simulator allows the evaluation of NDPs developed in a hardware description language placed in a simulated memory system. Arbitrary applications making use of the simulator feed the simulated memory system with loads and stores, and can control the NDPs. It is also possible to evaluate general-purpose NDPs running software threads. The complex system level interactions concerning coherence and remote data accesses are modeled in detail and provide valuable insights.

Two relevant benchmarks for both high-performance computing and data-intensive workloads are the High-Performance Conjugate Gradient (HPCG) benchmark, and the Graph500 benchmark. They implement a distributed multi-grid conjugate gradient solver and a distributed graph breadth-first search, respectively. Both benchmarks are implemented on the proposed architecture containing four NDPs, consisting of very small and power-efficient cores. By exploring both parameters of the architecture, as well as various software optimizations, we boost the performance of both benchmarks with a factor 3x compared to a CPU. A key feature is the high-bandwidth and low-latency interconnect between the NDPs, by means of the NDP-AP. The cacheability of remote data at the NDP-AP enables the fast access of shared data and is an important aspect for Graph500 performance. The use of user-enhanced coherence boosts performance in two ways. First, guiding the coarse-grained coherence mechanism at the NDP-M eliminates much of the required coherence directory lookups. Second, allowing remote data to be cached in NDP hardware-managed caches, improves data locality and performance, at the expense of more programming effort to manually maintain coherence.

A typical operation in big-data workloads is the sorting of data sets. Sorting data has, by nature, phases with a lot of data locality, and phases with little data locality. This opens up the intriguing possibility of heterogeneous CPU and NDP usage, where the two types of devices sort the high-locality, and low-locality phases, respectively. The CPU makes optimal use of its caches, while the NDP make optimal use of the high bandwidth to main memory. This is evaluated when considering a workload-optimized merge-sort NDP, and we obtain up to a factor 2.6x speedup compared to a CPU-only implementations. Given the very low power of the workload-optimized NDP, the overall energy-to-solution improvement is up to 2.5x.