Sometimes absolutely correct code can produces huge stalls because of cache misses.
For example if we have two 3D arrays: source ([128][128][16]) and destination ([16][128][128]) and we should copy data from source to destination.
Ordinary copying spends 5.0ms on old Athlon64 X2 3800+ CPU in this case.
But if we create a temporary copy array ([16][129][129]) and perform copying of data two times (from source to copy and from copy to destination), so copying time is only 1.9ms.
Could you please explain in more details how you acheive this performance boost? Did you change memory access patterns when copying to the temporary buffer? Or is their some relation between the cache size and temp. buffer non-power of 2 size?
ReplyDeleteThank you