Saturday, May 29, 2010

Intel SSE performance issue

Particle system render buffer generation has been deeply refactored to obtain better performance. There was a strange performance issue during this process... You can see two versions of the same code bellow. There is no performance difference on AMD CPU between the first and the second rendering code fragments. But on Intel Core i5 the difference is huge. The first version generates only 10M particles per second, while the second one shows 60M particles per second!


/* render vertex format
*/
struct Vertex {
union {
struct {
float xyz[3];
int parameters;
};
__m128 vec;
};
};
Vertex *v = ...;

/* first version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[1].vec = _mm_add_ps(...);
v[2].vec = _mm_add_ps(...);
v[3].vec = _mm_add_ps(...);

v[0].parameters = value_0;
v[1].parameters = value_1;
v[2].parameters = value_2;
v[3].parameters = value_3;

/* second version of rendering code
*/
v[0].vec = _mm_add_ps(...);
v[0].parameters = value_0;

v[1].vec = _mm_add_ps(...);
v[1].parameters = value_1;

v[2].vec = _mm_add_ps(...);
v[2].parameters = value_2;

v[3].vec = _mm_add_ps(...);
v[3].parameters = value_3;

3 comments:

  1. What compiler was used?
    Maybe this has something to do with L1 cache? Would be interesting to show the instruction disassembly of the two versions to see how compiler is doing the interchanged access to v[n] and test other cases within i5.

    ReplyDelete
  2. Compiler is "Visual C++ 2008 Express Edition". Seems like L1 cache miss occurs... I will try to post disassembly later.

    ReplyDelete
  3. Weird, traditionally Intel has more L1 cache than AMD. What was the performance on AMD? 60M part/s too?

    A good try is to use GCC and see how it compiles both codes. Have you tried to use valgrind+cachegrind? I always use these tools to keep my code friendly to cache and branch prediction.

    ReplyDelete