Saturday, May 29, 2010

Intel SSE performance issue

Particle system render buffer generation has been deeply refactored to obtain better performance. There was a strange performance issue during this process... You can see two versions of the same code bellow. There is no performance difference on AMD CPU between the first and the second rendering code fragments. But on Intel Core i5 the difference is huge. The first version generates only 10M particles per second, while the second one shows 60M particles per second!


/* render vertex format
*/
struct Vertex {
union {
struct {
float xyz[3];
int parameters;
};
__m128 vec;
};
};
Vertex *v = ...;

/* first version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[1].vec = _mm_add_ps(...);
v[2].vec = _mm_add_ps(...);
v[3].vec = _mm_add_ps(...);

v[0].parameters = value_0;
v[1].parameters = value_1;
v[2].parameters = value_2;
v[3].parameters = value_3;

/* second version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[0].parameters = value_0;

v[1].vec = _mm_add_ps(...);
v[1].parameters = value_1;

v[2].vec = _mm_add_ps(...);
v[2].parameters = value_2;

v[3].vec = _mm_add_ps(...);
v[3].parameters = value_3;

2 comments:

  1. Compiler is "Visual C++ 2008 Express Edition". Seems like L1 cache miss occurs... I will try to post disassembly later.

    ReplyDelete