We have three different GPGPU API nowadays. All of these API's do same job on same hardware, but results which they show are completely different.
I have implemented simple bitonic sort algorithm across these API. And results are terrible.
GeForce GTX 260 CUDA:
quickSort Time: 0.042 FPS: 23.8498
CUBitonic Time: 0.012 FPS: 80.6452
GeForce GTX 260 OpenCL:
quickSort Time: 0.038 FPS: 26.3442
CLBitonic Time: 0.014 FPS: 71.5461
GeForce GTX 260 DirectCompute:
quickSort Time: 0.039 FPS: 25.8766
D3D11Bitonic Time: 0.031 FPS: 31.8350
Radeon HD 5850 DirectCompute:
quickSort Time: 0.036 FPS: 27.8854
D3D11Bitonic Time: 0.017 FPS: 58.3669
Radeon HD 4850 DirectCompute:
quickSort Time: 0.031 FPS: 32.1885
D3D11Bitonic Time: 0.059 FPS: 16.9869
The number of elements are 259072. Sorted structure is uint2.
Timings of quickSort are presented only for CPU comparison.
OpenCL implementation doesn't work on current AMD GPU OpenCL.