Monday, February 25, 2013

OpenCL matrix multiplication

During the past week Adam has been focusing on course work, including the Embedded Parallel Computing course for which an OpenCL implementation of matrix multiplication was developed. While OpenCL shares many similarities with CUDA, using the bare API does require writing a lot of boilerplate code. For example, configuring a kernel function for execution on a GPU device requires explicitly connecting to the device, fetching its ID, creating a compute context and a command queue to which the kernel call will be enqueued. Before this can be done, each kernel function parameter needs to be set by an individual function call! Compared to this, in CUDA, it suffices to annotate the kernel function call with an execution configuration, which for simple programs can amount to a one-liner:

kernel_name <<< blocksInGrid, threadsInBlock >>> (x,y,z)


The rest of the application, including host as well as kernel code, is more similar, and OpenCL does provide some convenience functions which are not present in CUDA, for example get_global_id(x), which does the same thing as CUDA's ubiquitous blockIdx.x * blockDim.x + threadIdx.x incantation (calculating a thread's global identifier). The additional effort of an OpenCL implementation does however translate to portability, meaning that code written for one GPU has a good chance of running on another GPU or even on some other platforms, though without any guarantees with respect to performance. Also, recent developments in libraries and bindings such as SimpleOpenCL or PyOpenCL could give many of the benefits of CUDA (perhaps more) while retaining most of OpenCL's portability.