Thursday, February 18, 2010

OpenCL dds compression

DDS compression it's a complex task for CPU, but it can be accelerated on GPU. We use our internal DDS compressor which produces nice quality image. With small quality degradation we can port this algorithm on GPU and obtain 18x performance gain.

Name: GeForce GTX 260
Vendor: NVIDIA Corporation
Version: OpenCL 1.0 CUDA

RGB8 1024 768
Image Time: 0.338 FPS: 2.9617 RMS: 1.976 SNR: 42.214


CLImage Time: 0.018 FPS: 54.8908 RMS: 2.111 SNR: 41.642

Tuesday, February 9, 2010

Windows 7 Commemorative Edition

We have received a while ago this Windows 7 Commemorative Edition box (due to our participation with Heaven DirectX 11 Benchmark in Win7 launch event):


Nice.

Strange moire on HD5850

Sometime strange moire image happens on HD5850 under Windows7. And it disappears after video mode switching...



Please don't die.

Thursday, February 4, 2010

CUDA vs OpenCL vs DirectCompute Part III

DirectCompute limits GPU abilities... When GPU can write data into several UAV targets physically, Direct3D11 says that GPU can't do that. We can't use DX10 level cards for physics simulation via Direct3D11 API...

This is a small test application which produce simple particles physics. There are 40625 particles on static mesh consisting of 1392 triangles.

GTX260 CUDA (Windows7 11.5ms per frame)


GTX260 CUDA (Linux 13.3ms per frame)


GTX260 OpenCL (Windows7 13.9ms per frame)


GTX260 OpenCL (Linux 15.6ms per frame)


HD5850 Direct3D11 (Windows7 15.6ms per frame)

CUDA vs OpenCL vs DirectCompute Part II

Interop between compute and graphical API is very important thing. We can render generated geometry directly on GPU without additional data copy through system memory. For example in cloth simulation we can do physics and calculate tangent space entirely on GPU.

There are some tests of interop and compute API performance:

GTX260 OpenGL + CUDA (Windows7 128 FPS)


GTX260 OpenGL + CUDA (Linux 143 FPS)


GTX260 OpenGL + OpenCL (Windows7 151 FPS)


GTX260 OpenGL + OpenCL (Windows7 167 FPS)


GTX260 Direct3D11 (Windows7 100 FPS)


HD5850 Direct3D11 (Windows7 214 FPS)


Same GPU shows 50% difference in performance across different API. That is nightmare.

Tuesday, February 2, 2010

CUDA vs OpenCL vs DirectCompute

We have three different GPGPU API nowadays. All of these API's do same job on same hardware, but results which they show are completely different.

I have implemented simple bitonic sort algorithm across these API. And results are terrible.

GeForce GTX 260 CUDA:
quickSort Time: 0.042 FPS: 23.8498
CUBitonic Time: 0.012 FPS: 80.6452

GeForce GTX 260 OpenCL:
quickSort Time: 0.038 FPS: 26.3442
CLBitonic Time: 0.014 FPS: 71.5461

GeForce GTX 260 DirectCompute:
quickSort Time: 0.039 FPS: 25.8766
D3D11Bitonic Time: 0.031 FPS: 31.8350

Radeon HD 5850 DirectCompute:
quickSort Time: 0.036 FPS: 27.8854
D3D11Bitonic Time: 0.017 FPS: 58.3669

Radeon HD 4850 DirectCompute:
quickSort Time: 0.031 FPS: 32.1885
D3D11Bitonic Time: 0.059 FPS: 16.9869

The number of elements are 259072. Sorted structure is uint2.
Timings of quickSort are presented only for CPU comparison.
OpenCL implementation doesn't work on current AMD GPU OpenCL.