Advertisement

Blog

Innovative Timing Solutions Simplify Design of High-Performance Computing Accelerators

Cloud computing and artificial intelligence (AI) will be key to solving some of the world’s biggest challenges, accelerating scientific discovery, and increasing the pace of innovation in medical research, energy, healthcare and a wide range of other industries. Data scientists now have the capability to leverage AI and high-performance computing (HPC) to analyze vast amounts of data, learn insights and solve problems faster than ever. As the need for HPC increases, data centers are increasingly being optimized for high-performance computing workload acceleration. This in turn has spurred the need for specialized computing, networking and storage hardware optimized for low latency, high throughput data processing and network connectivity. This market trend has similarly increased the need for high-performance timing solutions to optimize the operation of HPC workload accelerators.

Server Acceleration

Hardware accelerators are used to speed up HPC workloads in data center applications. While graphics processing units (GPUs) have historically been used for this purpose, field programmable gate arrays (FPGAs) are increasingly becoming another viable option. Both solutions combine parallel processing, fast I/O, and high-speed memory interfaces to scale processing performance, enabling servers to efficiently run the neural networks powering search engines, speech recognition, natural-language translation and image processing. GPUs and FPGAs are transitioning to higher-speed 25 Gbps I/O interfaces to more easily scale co-processing between multiple ICs.

As shown in Figure 1, these high-speed I/O interfaces require low-jitter timing references to minimize bit-error rate and improve overall system performance. Low-jitter crystal oscillators (XOs) and clock generators are well suited for GPU/FPGA I/O clocking. High-performance timing devices such as Silicon Labs’ Si510 XO and Si5332 clock generator are ideally suited for this application because they combine low-jitter reference timing, small form factor and built-in power supply noise suppression, minimizing the impact of switched-mode power supply noise on high-speed I/O performance.

Figure 1

Reference timing for FPGA/GPU acceleration cards

Reference timing for FPGA/GPU acceleration cards

Network Interface Cards

Network interface cards (NICs) are used to connect servers and storage resources within a data center network. As the need for bandwidth increases, data centers are transitioning from using legacy 10GbE/40GbE fiber networks to using higher speed 25GbE/50GbE/100GbE networks. Not only must these NICs coordinate the transfer of large amounts of data at line speed, NICs are also used to offload specific workloads and applications from software into hardware, helping data centers operate more efficiently. NICs transfer data from PCIe to Ethernet and provide high-speed interfaces to the network. Timing devices like the Si53204 PCIe buffer from Silicon Labs can be used for PCIe clock distribution, and the Si510 XO can be used to provide a low-jitter reference clock for the Ethernet MAC/PHY.

Figure 2

Reference Timing for Network Interface Cards

Reference Timing for Network Interface Cards

Storage

In storage applications, the industry is rapidly transitioning from using hard disk drives based on low-speed SATA (6 Gbps) and SAS (12 Gbps) CPU/memory interconnect solutions to using solid-state storage devices based on the NVM Express® interface specification. A key benefit of NVM Express (NVMe) is that it reduces latency and enables faster memory access, making it an ideal solution for flash memory data transfer. Another benefit of NVMe is that it uses the popular PCI Express (PCIe) serial interface to interconnect SSDs with servers/CPUs, which already support embedded PCIe interfaces for high-speed serial data transfer.

As shown in the figure below, SSD controllers require a high-performance PCIe clock generator to provide reference timing. This clock must support spread spectrum clock generation to reduce EMI and ensure regulatory compliance with emissions standards. In addition, it is critical to select a future-proof clock source that is compatible with the recently ratified PCIe Gen 4 standard while providing backwards compatibility with PCIe Gen 1/2/3. The Si52204 buffer is an example of a spread spectrum clock generator that meets PCIe Gen 1/2/3/4 specifications with significant margin.

Figure 3

Reference timing for PCIe/NVMe SSD

Reference timing for PCIe/NVMe SSD

Faster Time-To-Market

Data center hardware is typically refreshed every two to three years. A key benefit of HPC accelerators and NVMe-based SSDs is that they can be deployed rapidly to help data center operators react to shifting market needs and launch new applications and web services faster. Another benefit is scalability. Add-in cards plug into a standard server motherboard using a PCIe connector, immediately providing expanded capabilities to an existing server. The design time for an add-in card can be as short as six months, enabling data center operators to add new capabilities and deploy new web services quickly without requiring a forklift replacement of equipment within a data center rack.

Time-to-market is also a key consideration for timing devices used with HPC accelerators and NVMe-based SSDs. Hardware designers should consider programmable timing solutions that can be individually tailored and optimized to meet their specific performance, power and space requirements.

The Future of High-Performance Computing Accelerators

Over the last several years there has been a significant proliferation of custom hardware solutions to address HPC and workload processing. This trend is expected to accelerate as new GPU, FPGA and ASIC products come to market that support lower latency, higher speed IO, higher capacity memory interfaces, and faster data transfer among CPUs, memory and accelerator cards.

Recently, the PCI-SIG working group ratified the PCIe Gen 4 standard, which supports CPU-memory-I/O-accelerator interconnect at a rate of 16 Gbps. Gen 4 compliant solutions are currently in development with mass deployment expected to start in 2019. Furthermore, the PCI-SIG just initiated work on PCIe Gen 5, which will enable CPU-memory-I/O-accelerator interconnect at a rate of 32 Gbps.

Not standing still, three competing standards have been defined to provide alternate solutions to PCIe. One of these new bus/interconnect standards is CCIX (Cache Coherent Interconnect for Accelerators). CCIX leverages the PCIe physical layer but extends the data rate up to 25 Gbps. It also specifies cache coherency between processors and accelerators. A competing standard is OpenCAPI (Coherent Accelerator Processor Interface). This expansion bus standard relies on IBM Power9 BlueLink 25 Gbps I/O for interconnect and supports Nvidia’s NVLink 2.0 protocol to enable coherent memory sharing between processors. The third standard is Gen-Z, a memory fabric that enables any device to communicate with other devices as if it were communicating with its own local memory, enabling any type of DRAM and NVM to be directly accessed by applications.

While it is difficult to predict which of these standards will prevail for future CPU-memory-I/O interconnect, one trend is clear. Future accelerator interconnect technologies will increasingly rely on high-performance timing solutions to optimize high-speed I/O performance. Future timing solutions must have excellent jitter performance to minimize system-level bit-error rates. Standards compliance and proven interoperability with FPGA/GPU suppliers will also be crucial, simplifying the interoperability between multiple standards and devices. Due to ever-increasing space and power constraints, future timing solutions must also be highly integrated, enabling a single component to provide all board-level timing.

0 comments on “Innovative Timing Solutions Simplify Design of High-Performance Computing Accelerators

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.