View Full Version : odd speed otpimization

January 10th, 2011, 09:05

I were optimizing for speed a small memset - where i need to optimize the 128-512 case. This is my code:

_asm {
pxor mm0, mm0
mov edi, TempEndPtr
mov ecx, len
xor eax, eax
movntq [edi], mm0
movntq [edi+8], mm0
movntq [edi+16], mm0
movntq [edi+32], mm0
movntq [edi+40], mm0
movntq [edi+48], mm0
movntq [edi+56], mm0
movntq [edi+64], mm0
movntq [edi+72], mm0
movntq [edi+80], mm0
movntq [edi+88], mm0
movntq [edi+96], mm0
movntq [edi+104], mm0
movntq [edi+112], mm0
movntq [edi+120], mm0
sub ecx, 128
add edi, 128
cmp ecx, 128
jg _128_loop
je _the_end
mov [edi], eax
add edi, 4
sub ecx, 4
cmp ecx, 0
jg _4_loop

with my surprise, it run SLOWER than filling ti manually with a mov [edi],eax/mov [edi+4],eax loop!!
even sadder, it runs SLOWER than:

mov ecx, len
mov edi, TempEndPtr
shr ecx, 2
xor eax, eax
inc ecx
rep stosd

any suggestion/comment?

January 10th, 2011, 10:02

I'm pretty sure, you've seen it already but nevertheless:


There was a discussion going on, about some similar problem. You might want to have a look.

I have one question about your code. Are the numbers following edi in brackets decimal or hex? If they are hexnumbers you have gaps in between because the space between the memory locations is quite inconsistent then. If they are decimal you have a gap, too. You're jumping from 16 to 32 leaving out 24. But that's probably totally unrelated.


January 10th, 2011, 10:53
Avisynth filter sdk the first one in google...

This is a interesting optimization and opcode,but as stated it will slow down the function right after it,I'm guessing a timing routine..

January 10th, 2011, 15:06
yep, i missed the +24, indeed

however, results are always the same: moving 8 bytes (with fast GP regs, not MMX/SSE ones) at time is the fastest solution, every other move (4,16,32,64,128) is slower.
The oddity lies in the fact i'd expect write combining to take place - especially using the ntq variant. Also, i'm on an OC i5, so i'd expect SSE to be fast and without old athlon penalties.


January 11th, 2011, 10:50
less then ~200kbytes moving is better with STOSD/MOVSD

ps: ahm, that for 32bit. you are trying on 64bit

ps2: how about using 64byte mem on loop + SFENCE after each.

January 12th, 2011, 13:58
Is destination address 16-byte aligned ?

January 12th, 2011, 15:03
hi all,

@evaluator: sfence is a serializing instruction - it would slowdown the loop and enforce&wait cached memory writes. In case, one might use it at end of a transfer sequence to ensure that weak memory ordering doesnt cause troubles.

@gamingmaster: no, that's why i'm not using XMM registers, transfer can be any size, and happen at any alignment.

The issue with REP MOSVS is that it's special circuitry 'pop up' after a number of transfer (unless it is changed on more recent processor), and I do not transfer enough data at time to cover the timing costs - so a simple mov wins over it for a low number of transferred bytes.

(by the way, the asm coding the memory transfers boosted the algorithm by 20%, and quickly recoding another routine in asm from C added almost the same ...still, I remember those forums where idiots were saying that C compilers can make equal or even better code than manual one... bah bah!)

What I find odd is, however, the fact that an unrolled MMX loop is slower than a movX2 loop, especially when the number of transferred bytes is between 200 and 300.

(ps: hehe, the IDIOT M$ compiler interpreted BY DEFAULT an INC [stuff] on byte instead of DWORD - damn them! ...and it shows a 'smart warning' saying 'hey, you forgot emms!' puah!)

January 14th, 2011, 11:15
There is a nice tool in dev on masm forums called testbed,this with a little feedback to the authors could help them greatly,and might make your life easier atleast for testing comparitivly.

Regards BanMe

January 14th, 2011, 15:35
Here's what MSVC does:
mov edi, [ebp+arg_0]
mov ecx, [ebp+arg_4]
shr ecx, 7
pxor xmm0, xmm0
jmp short $L
align 10h

movdqa xmmword ptr [edi], xmm0
movdqa xmmword ptr [edi+10h], xmm0
movdqa xmmword ptr [edi+20h], xmm0
movdqa xmmword ptr [edi+30h], xmm0
movdqa xmmword ptr [edi+40h], xmm0
movdqa xmmword ptr [edi+50h], xmm0
movdqa xmmword ptr [edi+60h], xmm0
movdqa xmmword ptr [edi+70h], xmm0
lea edi, [edi+80h]
dec ecx
jnz short $L

January 23rd, 2011, 18:41
...that code requires that your data is para aligned, whcih is rarely the case for smaller buffer.