CEO Pat Gelsinger’s re-imagining of Intel includes an enlarged focus and emphasis on software. To that end, he has installed Greg Lavender as Intel’s CTO and made him the head of all things software by appointing him as the general manager of the Software and Advanced Technology Group (SATG). On June 1, Joseph Curley, SATG’s Vice President and General Manager of Software Products and Ecosystem, used the community section of the company’s Website to announce that Intel had signed an agreement to purchase Codeplay, a supplier of parallel compilers and related tools that developers use to accelerate Big Data, HPC (High Performance Computing), AI (Artificial Intelligence), and ML (Machine Learning) workloads. Codeplay’s compilers generate code for many different CPUs and hardware accelerators. Curley wrote:
“Subject to the closing of the transaction, which we anticipate later this quarter, Codeplay will operate as a subsidiary business as part of Intel’s Software and Advanced Technology Group (SATG). Through the subsidiary structure, we plan to foster Codeplay’s unique entrepreneurial spirit and open ecosystem approach for which it is known and respected in the industry.”
This acquisition will bolster Intel’s efforts to develop one universal parallel programming language called DPC++, Intel’s implementation of the Khronos Group’s SYCL. Developers can program Intel’s growing stable of “XPUs” (CPUs and hardware accelerators) using DPC++, which is a major component in Intel’s oneAPI Basic Toolkit, which supports multiple hardware architectures through the DPC++ programming language, a set of library APIs, and a low-level hardware interface that fosters cross-architecture programming.
Just a few weeks prior to this announcement, on May 10, Codeplay’s Chief Business Officer Charles Macfarlane, gave an hour-long presentation at the Intel Vision event held in Dallas where he described his company’s work with SYCL, oneAPI, and DPC++ in some technical detail. Macfarlane explained that SYCL’s objectives are comparable to Nvidia’s CUDA. Both languages aim to accelerate code execution by running portions of the code called kernels on alternative execution engines. In CUDA’s case, the target accelerators are Nvidia GPUs. For SYCL and DPC++, choices are substantially wider.
SYCL takes a non-proprietary approach and has built-in mechanisms to permit easy retargeting of code to a variety of execution engines including CPUs, GPUs, and FPGAs. In other words, SYCL code is portable across architecture and across vendors. For example, Codeplay offers SYCL compilers that can target both Nvidia or AMD GPUs. Given the acquisition announcement, it probably won’t be long before Intel’s GPUs are added to this list. SYCL compilers also supportCPU architectures from multiple vendors. Consequently, coding in SYCL instead of CUDA allows developers to rapidly evaluate multiple CPUs and acceleration platforms and to pick the best one for their application. It also permits developers to possibly reduce the power consumption of their application by picking different accelerators based on their performance/power characteristics.
During his talk, Macfarlane recounted some significant examples that highlighted the effectiveness of oneAPI and DPC++ relative to CUDA. In one example, the Zuse Institute Berlin took code for a tsunami simulation workload called easyWave, which was originally written for Nvidia GPUs using CUDA, and automatically converted that code to DPC++ using Intel’s DPC++ Compatibility Tool (DPCT). The converted code can be retargeted to Intel CPUs, GPUs, and FPGAs by using the appropriate compilers and libraries. With yet another library and the appropriate Codeplay compiler, that SYCL code also can run on Nvidia GPUs. In fact, the Zuse Institute did run that converted DPC++ code on Nvidia GPUs for comparison and found that the performance results were within 4% of the original CUDA results, for machine-converted code with no additional tuning.
A 4% performance loss won’t get many people excited enough to convert from CUDA to DPC++, even if they acknowledge that a little tuning might achieve even better performance, so Macfarlane provided a more convincing example. Codeplay took N-body kernel code written in CUDA for Nvidia GPUs and converted it into SYCL code using DPCT. The N-body kernel is a complicated piece of multidimensional vector mathematics that simulates the motion of multiple particles under the influence of physical forces. Codeplay compiled the resulting SYCL code directly and did not further optimize or tune it. The original CUDA version of the N-body code kernel ran in 10.2 milliseconds on Nvidia GPUs. The converted DPC++ version of the N-body code kernel ran in 8.79 milliseconds on the same Nvidia GPUs. That’s a 14% performance improvement from machine-translated code, but it may be possible to do even better.
Macfarlane explained that there are two optimization levels available to developers for making DPC++ code run even faster: auto tuning, which selects the “best” algorithm from available libraries, and hand tuning using platform-specific optimization guidelines. There’s yet another optimization tool available to developers when targeting Intel CPUs and accelerators – the VTune Profiler – which is Intel’s widely used and highly respected performance analysis and power optimization tool. Originally, the VTune Profiler worked only on CPU code but Intel has extended the tool to cover code targeting GPUs and FPGAs as well and has now integrated VTune into Intel’s oneAPI Base Toolkit.
The open oneAPI platform offers two major benefits: multivendor compatibility and portability across different types of hardware accelerators. Multivendor compatibility means that the same code can run on hardware from AMD, Intel, Nvidia, or any other hardware vendor for which a compatible compiler is available. Portability across hardware accelerators allows developers to achieve better performance by compiling their code for different accelerators, analyzing the performance from each accelerator, and then picking the best result.
After Intel acquires Codeplay, it remains to be seen how well the new Intel subsidiary continues to support accelerator hardware from non-Intel vendors. Given Curley’s remarks quoted above and the open nature of oneAPI, it’s quite possible that Codeplay will continue to support multiple hardware vendors. Not only would this be the right thing to do for developers, it also hands Gelsinger an important set of metrics to measure any Intel XPU group that produces accelerator chips. These metrics will help to identify which Intel accelerators need work to keep up with or to exceed the competition’s performance. That’s just the sort of objective, market-driven stick that Gelsinger might want as he drives Intel towards his vision of the company’s future.