We have three different GPGPU API nowadays. All of these API's do same job on same hardware, but results which they show are completely different.
I have implemented simple bitonic sort algorithm across these API. And results are terrible.
GeForce GTX 260 CUDA:
quickSort Time: 0.042 FPS: 23.8498
CUBitonic Time: 0.012 FPS: 80.6452
GeForce GTX 260 OpenCL:
quickSort Time: 0.038 FPS: 26.3442
CLBitonic Time: 0.014 FPS: 71.5461
GeForce GTX 260 DirectCompute:
quickSort Time: 0.039 FPS: 25.8766
D3D11Bitonic Time: 0.031 FPS: 31.8350
Radeon HD 5850 DirectCompute:
quickSort Time: 0.036 FPS: 27.8854
D3D11Bitonic Time: 0.017 FPS: 58.3669
Radeon HD 4850 DirectCompute:
quickSort Time: 0.031 FPS: 32.1885
D3D11Bitonic Time: 0.059 FPS: 16.9869
The number of elements are 259072. Sorted structure is uint2.
Timings of quickSort are presented only for CPU comparison.
OpenCL implementation doesn't work on current AMD GPU OpenCL.
are u using shared memory in your implementation?
ReplyDeletemust say the timings look very odd ... the 4850 is the one with the worst specs for these kind of computations for sure the 5850 is much much faster ...
This implementation uses shared memory and same kernels for all API. Bitonic block size is 512, transposition block size is 16x16.
ReplyDeleteWhat are the OpenCL-results on the ATI-cards?
ReplyDeleteThe result is: "Display driver stopped responding and has recovered".
ReplyDeletewhich API do you like working in most?
ReplyDeleteCan I please have the algorithm or code for the quicksort in cuda? I've to compare the performance of quicksort in various gpus in my lab.
ReplyDeleteThanks in advance.
Could you share the source code of these projects
ReplyDelete