Introducing CLUtil

March 14, 2016: Reformatting

Edited: May 4, 2013

Edited: September 6, 2013: Change CLUtil links to a tagged version compatible with the code presented here.

OpenCL is a cross-platform parallel programming standard with support for execution on both CPUs and GPUs. The OpenCL package on hackage provides a direct binding to the API with just enough Haskellosity to make invoking those API functions borderline pleasant. That said, there remains a certain amount of boilerplate that is rather offputting.

For context, we are trying to write program fragments that may be executed in parallel to improve efficiency. Multi-core CPUs are good at this, GPUs can be even better. We can bend GLSL to our needs (see my modern OpenGL with GLSL in Haskell tutorial), but are then left to contend with various graphics and GUI considerations that are completely irrelevant to our goals. OpenCL represents a much more familiar looking programming environment (a kind of C with support for small vectors), with less setup overhead than a GPGPU program using GLSL on the backend.

So let's get started: How can we use OpenCL from a Haskell program?

First we'll write a simple OpenCL kernel that adds two vectors, a and b, and stores the element-wise square of the result in c. To make it slightly interesting we'll operate on four floating point numbers at a time.

__kernel void vecAdd(__global float *a, 
                     __global float *b,
                     __global float *c) {
  size_t i = get_global_id(0);
  float4 r = vload4(i,a) + vload4(i,b);
  vstore4(r*r,i,c);
}

I've saved this code in a file called VecEZ.cl.

To recap: we have an OpenCL kernel that takes two read-only input vectors and a write-only output vector. This sort of configuration fits rather well into a functional setting, so it is where we will focus our efforts as we endeavour to make the common things easy. Now we're ready for some Haskell:

import Control.Parallel.CLUtil
import qualified Data.Vector.Storable as V

I've imported my CLUtil library that provides some simple helpers for common Haskell-OpenCL operations. Now let's write a program that initializes OpenCL, loads and compiles an OpenCL kernel stored in a separate file, and run that kernel on some data.

test1 = do s <- ezInit CL_DEVICE_TYPE_CPU
           k <- kernelFromFile s "VecEZ.cl" "vecAdd"
           let v1 = V.fromList [1,2,3,4::Float]
               v2 = V.fromList [5,6,7,8::Float]
           v3 <- runKernel s k v1 v2 (Out 4) (Work1D 1)
           print (v3::Vector Float)

And that's it! The ezInit function intializes an OpenCL device and sets up a context and queue, all of which are returned in a record. The kernelFromFile function reads OpenCL source code from a file, then builds the named kernel using the previously initialized OpenCL state record. Finally, we produce some data in Haskell and call runKernel.

The runKernel action is variadic in an attempt to meet most of your kernel-running needs. Its arguments are an OpenCL state record, a kernel to run, the parameters to be passed to the OpenCL kernel, and the number of global work items (i.e. how many times to invoke the kernel). Recall that our kernel had three parameters, and operated on four elements at once. So, we pass the kernel the two input vectors, a note requesting an output vector of length four, and a specification that our kernel should be run exactly once. We bind the output to another vector, and provide a type annotation on the next line to let runKernel figure out how much memory to allocate for the output vector. This is potentially confusing, hopefully more examples will clarify usage.

Next, let's consider another kernel that demonstrates a bit more variety in the types.

__kernel void funnyBusiness(__global double *a,
                            __global double *b,
                            __global double *c,
                            __global int *s) {
  size_t i = get_global_id(0);
  double4 r = vload4(i,a) + vload4(i,b);
  double4 rsq = r * r;
  vstore4(rsq, i, c);
  s[i] = (int)(rsq.x + rsq.y + rsq.z + rsq.w);
}

This kernel is similar to the first, but uses the double scalar type, and also takes an int pointer as an argument that has an int cast of the horizontal sum of the result vector written to it. The Haskell side looks very similar to the last time,

test2 = do s <- ezInit CL_DEVICE_TYPE_CPU
           k <- kernelFromFile s "VecEZ.cl" "funnyBusiness"
           let v1 = V.enumFromN  0 12 :: Vector Double
               v2 = V.enumFromN 12 12 :: Vector Double
           (v3,v4) <- runKernel s k v1 v2
                                (Out 12) -- v3 is a 12-element output
                                (Out 3)  -- v4 is a 3-element output
                                (Work1D $ 12 `div` 4)
           print $ V.sum (v3::Vector Double)
           print $ V.sum (v4::Vector CInt)

A few things to note:

One last bit of mystery that remains is the Work1D specification. OpenCL supports up to three dimensions of work items, which can ease the handling of multi-dimensional data. If we were adding a matrix, for example, we might have the elements of the matrices stored in Vectors, but the elements are intuitively addressed by a pair of coordinates. A naive OpenCL kernel for adding matrices might go something like this,

__kernel void floaty(__global float *a, 
                     __global float *b,
                     __global float *c) {
  size_t x = get_global_id(0);
  size_t y = get_global_id(1);
  size_t i = y * get_global_size(1) + x;
  c[i] = a[i] + b[i];
}

On the OpenCL side, we're not gaining anything by working with two dimensions of work item indices, but sometimes these coordinates can inform coordinate-dependent logic in the OpenCL program.

The Haskell side is once again very similar, with the exception of the new Work2D specification.

test3 = do s <- ezInit CL_DEVICE_TYPE_GPU
           k <- kernelFromFile s "VecEZ2.cl" "floaty"
           let v1 = V.enumFromN  0 16 :: Vector Float
               v2 = V.enumFromN 16 16 :: Vector Float
           v3 <- runKernel s k v1 v2 (Out 16) (Work2D 4 4)
                 :: IO (Vector Float)
           mapM_ (\i -> print (V.slice i 4 v3)) [0,4..12]

Note that here we explicitly request a GPU to run our program. The GPU in my laptop supports floats, but not doubles, so I put this kernel in a separate file from the others.

Hopefully usage of CLUtil is becoming clear. So let's crank things up and take advantage of the Haskell ecosystem.

A blog post describing quasicrystal figures serves as an interesting test of OpenCL performance. As a first pass, we can port the pixel computations from that post to OpenCL without thinking about optimization…

float wave(float,float,float,float);
float wrap(float);

__kernel void quasiCrystal(int pixels,
                           float scale,
                           float phase,
                           int numAngles,
                           __constant float *angles,
                           __global uchar *img) {
  int x = get_global_id(0);
  int y = get_global_id(1);
  float denom = (float)pixels - 1;
  float u = scale * ((2 * (float)x / denom) - 1);
  float v = scale * ((2 * (float)y / denom) - 1);
  float sum = 0.0f;
  for(int i = 0; i < numAngles; ++i) {
    sum += wave(phase, angles[i], u, v);
  }
  uchar r = (uchar)(255.0f * clamp(wrap(sum), 0.0f, 1.0f));
  vstore4((uchar4)(255,r,128,r), y*pixels + x, img);
}

float wrap(float n) {
  int k = (int)n;
  if(n < 0) k -= 1;
  float v = n - (float)k;
  if(k%2 == 1) v = 1 - v;
  return v;
}

float wave(float phase, float theta, float x, float y) {
  return (cos(cos(theta) * x + sin(theta)*y + phase) + 1) / 2;
}

…then write the control logic in Haskell. We also tie our image generation into the gloss library for a real-time display.

import Control.Parallel.CLUtil
import qualified Data.Vector.Storable as V
import Data.Word (Word8)
import Graphics.Gloss hiding (Vector, scale)
import Graphics.Gloss.Interface.IO.Animate hiding (Vector, scale)

pixels :: Int
pixels = 800

scale :: Float
scale = 128

angles :: Int -> Vector Float
angles n = V.map (* (pi / fromIntegral n)) $ V.enumFromN 0 n

mkPicture :: Vector Word8 -> Picture
mkPicture = flip (bitmapOfForeignPtr pixels pixels) False
          . (\(x,_) -> x)
          . V.unsafeToForeignPtr0

main = do s <- ezInit CL_DEVICE_TYPE_ALL
          k <- kernelFromFile s "QuasiCrystalRGBA.cl" "quasiCrystal"
          let numPix = pixels * pixels
              pixels' = fromIntegral pixels :: CInt
              numAngles = 7
              allAngles = angles numAngles
              frame :: Float -> IO Picture
              frame phase = mkPicture `fmap`
                            runKernel s k pixels' scale phase
                                      numAngles allAngles
                                      (Out (numPix*4)) (Work2D pixels pixels)
          animateIO (InWindow "Quasicrystal" (pixels,pixels) (0,0)) black frame

On my laptop, running on the CPU consumes 300% of a dual core i5 (with hyperthreading), while running on the GPU consumes 85% CPU. While this example is somewhat more complicated, the ability to switch between CPU and GPU computation driving a visualization, all from an interactive interpreter is highly compelling. The animation runs at a reasonable clip on a dual core CPU, so try it out yourself! It should look something like this,

quasicrystal.png

These programs may all be found in the examples directory of the CLUtil package.

P.S. When building executables on your system, you may need to supply GHC with an -lOpenCL option. This is not necessary in Mac OS X, where we use the installed OpenCL framework, but is likely needed elsewhere.