Saturday, May 29, 2010

PlayStation3 Unigine status VI

The first successfully implemented SPU-accelerated thing in Unigine is particle systems render buffer generation. We have wide amount of different particle shapes: billboard, flat, point, length, random. This shapes are generated absolutely asynchronously on all available SPU's with optional depth sorting of particles, which doesn't affect performance a lot. The next target to be boosted is MeshSkinned.

Unigine SPU runtime

Insomniac papers about their SPU shaders and SPU job management inspired me to implement such system in Unigine. We use raw SPU with SPU-based shader loading scheme. The total amount of code of this system, which parses ELF executable files itself and manages SPU shaders, is just only 1270 lines :) We can run 10000 SPU shaders with up to 8 parameters consuming only 6.7ms of PPU time. Another tasty feature of this system is ability to execute any function from SPU ELF file.

SPU sorting

As you might know from the public papers, there are only 256 Kb of local memory on SPU, but DMA requests are very fast... quickSort isn't an appropriate algorithm for SPU architecture due to branching and generation of large number of spatially non-coherent memory requests. After several hours of attempts to keep SPU quickSort performance on an acceptable level and writing software implementation of memory cache for SPU, the resulting performance is slightly better than PPU version (on small arrays):

elements: 1012 (16Kb)
quickSort PPU Time: 0.000 FPS: 3717.4719
quickSort SPU Time: 0.000 FPS: 5376.3438

elements: 2024 (32Kb)
quickSort PPU Time: 0.001 FPS: 1923.0769
quickSort SPU Time: 0.000 FPS: 2652.5200

elements: 4048 (64Kb)
quickSort PPU Time: 0.001 FPS: 819.6721
quickSort SPU Time: 0.001 FPS: 908.2652

elements: 8096 (128Kb)
quickSort PPU Time: 0.003 FPS: 398.2477
quickSort SPU Time: 0.002 FPS: 407.3320

elements: 16192 (256Kb)
quickSort PPU Time: 0.005 FPS: 187.7229
quickSort SPU Time: 0.006 FPS: 180.3752

elements: 32384 (512Kb)
quickSort PPU Time: 0.012 FPS: 86.3185
quickSort SPU Time: 0.013 FPS: 78.7030

elements: 64768 (1Mb)
quickSort PPU Time: 0.027 FPS: 37.6322
quickSort SPU Time: 0.029 FPS: 34.2044

elements: 129536 (2Mb)
quickSort PPU Time: 0.056 FPS: 17.8352
quickSort SPU Time: 0.063 FPS: 15.8253

elements: 259072 (4Mb)
quickSort PPU Time: 0.124 FPS: 8.0358
quickSort SPU Time: 0.139 FPS: 7.2073

It was very difficult to sleep after this poor results... I was trying to implement radixSort on the second day in the morning... SPU instruction set fits such algorithms very well. Performance of radixSort on local SPU memory appeared to be very good especially with eliminated branching instructions. Moreover performance of DMA list operations on SPU (surprise!) is great and the resulted version of radixSort demonstrates awesome speedup:

elements: 1012 (16Kb)
radixSort PPU Time: 0.000 FPS: 4081.6326
radixSort SPU Time: 0.000 FPS: 9615.3848

elements: 2024 (32Kb)
radixSort PPU Time: 0.000 FPS: 2617.8010
radixSort SPU Time: 0.000 FPS: 4032.2581

elements: 4048 (64Kb)
radixSort PPU Time: 0.001 FPS: 1333.3334
radixSort SPU Time: 0.000 FPS: 2237.1365

elements: 8096 (128Kb)
radixSort PPU Time: 0.001 FPS: 673.4007
radixSort SPU Time: 0.001 FPS: 1168.2242

elements: 16192 (256Kb)
radixSort PPU Time: 0.003 FPS: 288.8504
radixSort SPU Time: 0.002 FPS: 597.0150

elements: 32384 (512Kb)
radixSort PPU Time: 0.008 FPS: 124.1311
radixSort SPU Time: 0.003 FPS: 298.3294

elements: 64768 (1Mb)
radixSort PPU Time: 0.022 FPS: 45.3700
radixSort SPU Time: 0.007 FPS: 149.8352

elements: 129536 (2Mb)
radixSort PPU Time: 0.049 FPS: 20.5351
radixSort SPU Time: 0.013 FPS: 75.0413

elements: 259072 (4Mb)
radixSort PPU Time: 0.101 FPS: 9.9149
radixSort SPU Time: 0.027 FPS: 37.5587

Intel SSE performance issue

Particle system render buffer generation has been deeply refactored to obtain better performance. There was a strange performance issue during this process... You can see two versions of the same code bellow. There is no performance difference on AMD CPU between the first and the second rendering code fragments. But on Intel Core i5 the difference is huge. The first version generates only 10M particles per second, while the second one shows 60M particles per second!


/* render vertex format
*/
struct Vertex {
union {
struct {
float xyz[3];
int parameters;
};
__m128 vec;
};
};
Vertex *v = ...;

/* first version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[1].vec = _mm_add_ps(...);
v[2].vec = _mm_add_ps(...);
v[3].vec = _mm_add_ps(...);

v[0].parameters = value_0;
v[1].parameters = value_1;
v[2].parameters = value_2;
v[3].parameters = value_3;

/* second version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[0].parameters = value_0;

v[1].vec = _mm_add_ps(...);
v[1].parameters = value_1;

v[2].vec = _mm_add_ps(...);
v[2].parameters = value_2;

v[3].vec = _mm_add_ps(...);
v[3].parameters = value_3;

Monday, May 17, 2010

PlayStation3 Unigine status V

PlayStation3 render has been refactored several times:) Now we have stable and fast render pipeline without any CPU/GPU sync points. Render present time is always positive and we can hide update, physics and command buffer generation time when GPU renders previous frame. All Unigine demos work stable and without any rendering artifacts. Main bottleneck is GPU and SPU should helps a lot, especially in geometry culling :)

Sanctuary forward lighting:

Sanctuary pre-pass lighting:


Tropics forward lighting:

Tropics pre-pass lighting:


Heaven forward lighting:

Heaven pre-pass lighting:

Wednesday, May 12, 2010

Thursday, May 6, 2010

Updated ObjectWater shading

New water subsurface shading state has been added. Over/Under water transition artifacts has been removed also.




Saturday, May 1, 2010

PlayStation3 Unigine status V

Direct port to PlayStation3 is complete. Render and physics works properly with expected performance. Time for optimizations is coming...

Unigine Editor on PlayStation3

PlayStation3, 1920x1080 video mode, loaded Sanctuary and Unigine editor. We can tune art assets on PS3 directly by Unigine editor. But I will append world replication from host PC to PS3 next week...