Introducing CLUtil
March 14, 2016: Reformatting
Edited: May 4, 2013
Edited: September 6, 2013: Change CLUtil links to a tagged version compatible with the code presented here.
OpenCL is a cross-platform parallel programming standard with support for execution on both CPUs and GPUs. The OpenCL package on hackage provides a direct binding to the API with just enough Haskellosity to make invoking those API functions borderline pleasant. That said, there remains a certain amount of boilerplate that is rather offputting.
For context, we are trying to write program fragments that may be executed in parallel to improve efficiency. Multi-core CPUs are good at this, GPUs can be even better. We can bend GLSL to our needs (see my modern OpenGL with GLSL in Haskell tutorial), but are then left to contend with various graphics and GUI considerations that are completely irrelevant to our goals. OpenCL represents a much more familiar looking programming environment (a kind of C with support for small vectors), with less setup overhead than a GPGPU program using GLSL on the backend.
So let's get started: How can we use OpenCL from a Haskell program?
First we'll write a simple OpenCL kernel that adds two vectors, a
and b
, and stores the element-wise square of the result in c
. To
make it slightly interesting we'll operate on four floating point
numbers at a time.
__kernel void vecAdd(__global float *a, __global float *b, __global float *c) { size_t i = get_global_id(0); float4 r = vload4(i,a) + vload4(i,b); vstore4(r*r,i,c); }
I've saved this code in a file called VecEZ.cl
.
To recap: we have an OpenCL kernel that takes two read-only input vectors and a write-only output vector. This sort of configuration fits rather well into a functional setting, so it is where we will focus our efforts as we endeavour to make the common things easy. Now we're ready for some Haskell:
import Control.Parallel.CLUtil import qualified Data.Vector.Storable as V
I've imported my CLUtil library that provides some simple helpers for common Haskell-OpenCL operations. Now let's write a program that initializes OpenCL, loads and compiles an OpenCL kernel stored in a separate file, and run that kernel on some data.
test1 = do s <- ezInit CL_DEVICE_TYPE_CPU k <- kernelFromFile s "VecEZ.cl" "vecAdd" let v1 = V.fromList [1,2,3,4::Float] v2 = V.fromList [5,6,7,8::Float] v3 <- runKernel s k v1 v2 (Out 4) (Work1D 1) print (v3::Vector Float)
And that's it! The ezInit
function intializes an OpenCL
device and sets up a context and queue, all of which are returned in a
record. The kernelFromFile
function reads OpenCL source
code from a file, then builds the named kernel using the previously
initialized OpenCL state record. Finally, we produce some data in
Haskell and call runKernel
.
The runKernel
action is variadic in an attempt to meet
most of your kernel-running needs. Its arguments are an OpenCL state
record, a kernel to run, the parameters to be passed to the OpenCL
kernel, and the number of global work items (i.e. how many times to
invoke the kernel). Recall that our kernel had three parameters, and
operated on four elements at once. So, we pass the kernel the two
input vectors, a note requesting an output vector of length four, and
a specification that our kernel should be run exactly once. We bind
the output to another vector, and provide a type annotation on the
next line to let runKernel
figure out how much memory to
allocate for the output vector. This is potentially confusing,
hopefully more examples will clarify usage.
Next, let's consider another kernel that demonstrates a bit more variety in the types.
__kernel void funnyBusiness(__global double *a, __global double *b, __global double *c, __global int *s) { size_t i = get_global_id(0); double4 r = vload4(i,a) + vload4(i,b); double4 rsq = r * r; vstore4(rsq, i, c); s[i] = (int)(rsq.x + rsq.y + rsq.z + rsq.w); }
This kernel is similar to the first, but uses the double
scalar type,
and also takes an int
pointer as an argument that has an int
cast of
the horizontal sum of the result vector written to it. The Haskell side
looks very similar to the last time,
test2 = do s <- ezInit CL_DEVICE_TYPE_CPU k <- kernelFromFile s "VecEZ.cl" "funnyBusiness" let v1 = V.enumFromN 0 12 :: Vector Double v2 = V.enumFromN 12 12 :: Vector Double (v3,v4) <- runKernel s k v1 v2 (Out 12) -- v3 is a 12-element output (Out 3) -- v4 is a 3-element output (Work1D $ 12 `div` 4) print $ V.sum (v3::Vector Double) print $ V.sum (v4::Vector CInt)
A few things to note:
v3
andv4
have different types. TherunKernel
action always returnsVector
s, be they individual as intest1
, or tupled as here. TheCLUtil
library currently supports up to three output vectors, but this is an arbitrary limit that will be raised as the library matures.- Once again we operate on four input elements at a time. Given that our source vectors have twelve elements each, we must run three work items.
- The type annotation on
v4
isCInt
rather thanInt
. This is important as the size of Haskell'sInt
may be 32 or 64 bits, while OpenCL'sint
is always 32 bits.
One last bit of mystery that remains is the Work1D
specification. OpenCL supports up to three dimensions of work items,
which can ease the handling of multi-dimensional data. If we were
adding a matrix, for example, we might have the elements of the
matrices stored in Vector
s, but the elements are intuitively
addressed by a pair of coordinates. A naive OpenCL kernel for adding
matrices might go something like this,
__kernel void floaty(__global float *a, __global float *b, __global float *c) { size_t x = get_global_id(0); size_t y = get_global_id(1); size_t i = y * get_global_size(1) + x; c[i] = a[i] + b[i]; }
On the OpenCL side, we're not gaining anything by working with two dimensions of work item indices, but sometimes these coordinates can inform coordinate-dependent logic in the OpenCL program.
The Haskell side is once again very similar, with the exception of the
new Work2D
specification.
test3 = do s <- ezInit CL_DEVICE_TYPE_GPU k <- kernelFromFile s "VecEZ2.cl" "floaty" let v1 = V.enumFromN 0 16 :: Vector Float v2 = V.enumFromN 16 16 :: Vector Float v3 <- runKernel s k v1 v2 (Out 16) (Work2D 4 4) :: IO (Vector Float) mapM_ (\i -> print (V.slice i 4 v3)) [0,4..12]
Note that here we explicitly request a GPU to run our program. The GPU in my laptop supports floats, but not doubles, so I put this kernel in a separate file from the others.
Hopefully usage of CLUtil
is becoming clear. So let's crank things up
and take advantage of the Haskell ecosystem.
A blog post describing quasicrystal figures serves as an interesting test of OpenCL performance. As a first pass, we can port the pixel computations from that post to OpenCL without thinking about optimization…
float wave(float,float,float,float); float wrap(float); __kernel void quasiCrystal(int pixels, float scale, float phase, int numAngles, __constant float *angles, __global uchar *img) { int x = get_global_id(0); int y = get_global_id(1); float denom = (float)pixels - 1; float u = scale * ((2 * (float)x / denom) - 1); float v = scale * ((2 * (float)y / denom) - 1); float sum = 0.0f; for(int i = 0; i < numAngles; ++i) { sum += wave(phase, angles[i], u, v); } uchar r = (uchar)(255.0f * clamp(wrap(sum), 0.0f, 1.0f)); vstore4((uchar4)(255,r,128,r), y*pixels + x, img); } float wrap(float n) { int k = (int)n; if(n < 0) k -= 1; float v = n - (float)k; if(k%2 == 1) v = 1 - v; return v; } float wave(float phase, float theta, float x, float y) { return (cos(cos(theta) * x + sin(theta)*y + phase) + 1) / 2; }
…then write the control logic in Haskell. We also tie our image generation into the gloss library for a real-time display.
import Control.Parallel.CLUtil import qualified Data.Vector.Storable as V import Data.Word (Word8) import Graphics.Gloss hiding (Vector, scale) import Graphics.Gloss.Interface.IO.Animate hiding (Vector, scale) pixels :: Int pixels = 800 scale :: Float scale = 128 angles :: Int -> Vector Float angles n = V.map (* (pi / fromIntegral n)) $ V.enumFromN 0 n mkPicture :: Vector Word8 -> Picture mkPicture = flip (bitmapOfForeignPtr pixels pixels) False . (\(x,_) -> x) . V.unsafeToForeignPtr0 main = do s <- ezInit CL_DEVICE_TYPE_ALL k <- kernelFromFile s "QuasiCrystalRGBA.cl" "quasiCrystal" let numPix = pixels * pixels pixels' = fromIntegral pixels :: CInt numAngles = 7 allAngles = angles numAngles frame :: Float -> IO Picture frame phase = mkPicture `fmap` runKernel s k pixels' scale phase numAngles allAngles (Out (numPix*4)) (Work2D pixels pixels) animateIO (InWindow "Quasicrystal" (pixels,pixels) (0,0)) black frame
On my laptop, running on the CPU consumes 300% of a dual core i5 (with hyperthreading), while running on the GPU consumes 85% CPU. While this example is somewhat more complicated, the ability to switch between CPU and GPU computation driving a visualization, all from an interactive interpreter is highly compelling. The animation runs at a reasonable clip on a dual core CPU, so try it out yourself! It should look something like this,
These programs may all be found in the examples directory of the CLUtil package.
P.S. When building executables on your system, you may need to supply
GHC with an -lOpenCL
option. This is not necessary in Mac OS X, where
we use the installed OpenCL framework, but is likely needed elsewhere.