Tuesday, February 2, 2010

CUDA vs OpenCL vs DirectCompute

We have three different GPGPU API nowadays. All of these API's do same job on same hardware, but results which they show are completely different.

I have implemented simple bitonic sort algorithm across these API. And results are terrible.

GeForce GTX 260 CUDA:
quickSort Time: 0.042 FPS: 23.8498
CUBitonic Time: 0.012 FPS: 80.6452

GeForce GTX 260 OpenCL:
quickSort Time: 0.038 FPS: 26.3442
CLBitonic Time: 0.014 FPS: 71.5461

GeForce GTX 260 DirectCompute:
quickSort Time: 0.039 FPS: 25.8766
D3D11Bitonic Time: 0.031 FPS: 31.8350

Radeon HD 5850 DirectCompute:
quickSort Time: 0.036 FPS: 27.8854
D3D11Bitonic Time: 0.017 FPS: 58.3669

Radeon HD 4850 DirectCompute:
quickSort Time: 0.031 FPS: 32.1885
D3D11Bitonic Time: 0.059 FPS: 16.9869

The number of elements are 259072. Sorted structure is uint2.
Timings of quickSort are presented only for CPU comparison.
OpenCL implementation doesn't work on current AMD GPU OpenCL.


  1. are u using shared memory in your implementation?
    must say the timings look very odd ... the 4850 is the one with the worst specs for these kind of computations for sure the 5850 is much much faster ...

  2. This implementation uses shared memory and same kernels for all API. Bitonic block size is 512, transposition block size is 16x16.

  3. What are the OpenCL-results on the ATI-cards?

  4. The result is: "Display driver stopped responding and has recovered".

  5. which API do you like working in most?

  6. Can I please have the algorithm or code for the quicksort in cuda? I've to compare the performance of quicksort in various gpus in my lab.
    Thanks in advance.

  7. Could you share the source code of these projects