Results 1 to 10 of 10

Thread: odd speed otpimization

  1. #1

    odd speed otpimization

    Hi,

    I were optimizing for speed a small memset - where i need to optimize the 128-512 case. This is my code:
    Code:
    _asm {
    						pxor	mm0, mm0
    						mov edi, TempEndPtr
    						mov ecx, len
    						xor eax, eax
    					_128_loop:
    						movntq	[edi], mm0
    						movntq	[edi+8], mm0
    						movntq	[edi+16], mm0
    						movntq	[edi+32], mm0
    						movntq	[edi+40], mm0
    						movntq	[edi+48], mm0
    						movntq	[edi+56], mm0
    						//
    						movntq	[edi+64], mm0
    						movntq	[edi+72], mm0
    						movntq	[edi+80], mm0
    						movntq	[edi+88], mm0
    						movntq	[edi+96], mm0
    						movntq	[edi+104], mm0
    						movntq	[edi+112], mm0
    						movntq	[edi+120], mm0
    						sub ecx, 128
    						add edi, 128
    						cmp ecx, 128
    						jg  _128_loop
    						je _the_end
    					_4_loop:
    						mov [edi], eax
    						add edi, 4
    						sub ecx, 4
    						cmp ecx, 0
    						jg _4_loop
    					_the_end:
    with my surprise, it run SLOWER than filling ti manually with a mov [edi],eax/mov [edi+4],eax loop!!
    even sadder, it runs SLOWER than:
    Code:
    mov ecx, len
    						mov edi, TempEndPtr
    						shr ecx, 2
    						xor eax, eax
    						inc ecx
    						rep stosd
    any suggestion/comment?
    I want to know God's thoughts ...the rest are details.
    (A. Einstein)
    --------
    ..."a shellcode is a command you do at the linux shell"...

  2. #2
    Hi,

    I'm pretty sure, you've seen it already but nevertheless:

    http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2004-06/0004.html

    There was a discussion going on, about some similar problem. You might want to have a look.

    I have one question about your code. Are the numbers following edi in brackets decimal or hex? If they are hexnumbers you have gaps in between because the space between the memory locations is quite inconsistent then. If they are decimal you have a gap, too. You're jumping from 16 to 32 leaving out 24. But that's probably totally unrelated.

    Regards
    darkelf
    Last edited by Darkelf; January 10th, 2011 at 10:58.

  3. #3
    |< x != '+' BanMe's Avatar
    Join Date
    Oct 2008
    Location
    Farmington NH
    Posts
    510
    Blog Entries
    4
    Avisynth filter sdk the first one in google...

    This is a interesting optimization and opcode,but as stated it will slow down the function right after it,I'm guessing a timing routine..
    No hate for the lost children;
    more love for the paths we walk,
    'words' shatter the truth we seek.
    from the heart and mind of Me
    me, to you.. down and across

    No more words from me, to you...
    Hate and love shatter the heart and Mind of Me.
    For the Lost Children;For the paths we walk; the real truth we seek!

  4. #4
    yep, i missed the +24, indeed

    however, results are always the same: moving 8 bytes (with fast GP regs, not MMX/SSE ones) at time is the fastest solution, every other move (4,16,32,64,128) is slower.
    The oddity lies in the fact i'd expect write combining to take place - especially using the ntq variant. Also, i'm on an OC i5, so i'd expect SSE to be fast and without old athlon penalties.

    mah...
    I want to know God's thoughts ...the rest are details.
    (A. Einstein)
    --------
    ..."a shellcode is a command you do at the linux shell"...

  5. #5
    Musician member evaluator's Avatar
    Join Date
    Sep 2001
    Posts
    1,479
    Blog Entries
    1
    less then ~200kbytes moving is better with STOSD/MOVSD

    ps: ahm, that for 32bit. you are trying on 64bit

    ps2: how about using 64byte mem on loop + SFENCE after each.

  6. #6
    Is destination address 16-byte aligned ?

  7. #7
    hi all,

    @evaluator: sfence is a serializing instruction - it would slowdown the loop and enforce&wait cached memory writes. In case, one might use it at end of a transfer sequence to ensure that weak memory ordering doesnt cause troubles.

    @gamingmaster: no, that's why i'm not using XMM registers, transfer can be any size, and happen at any alignment.

    The issue with REP MOSVS is that it's special circuitry 'pop up' after a number of transfer (unless it is changed on more recent processor), and I do not transfer enough data at time to cover the timing costs - so a simple mov wins over it for a low number of transferred bytes.

    (by the way, the asm coding the memory transfers boosted the algorithm by 20%, and quickly recoding another routine in asm from C added almost the same ...still, I remember those forums where idiots were saying that C compilers can make equal or even better code than manual one... bah bah!)

    What I find odd is, however, the fact that an unrolled MMX loop is slower than a movX2 loop, especially when the number of transferred bytes is between 200 and 300.

    (ps: hehe, the IDIOT M$ compiler interpreted BY DEFAULT an INC [stuff] on byte instead of DWORD - damn them! ...and it shows a 'smart warning' saying 'hey, you forgot emms!' puah!)
    Last edited by Maximus; January 12th, 2011 at 16:32.
    I want to know God's thoughts ...the rest are details.
    (A. Einstein)
    --------
    ..."a shellcode is a command you do at the linux shell"...

  8. #8
    |< x != '+' BanMe's Avatar
    Join Date
    Oct 2008
    Location
    Farmington NH
    Posts
    510
    Blog Entries
    4
    There is a nice tool in dev on masm forums called testbed,this with a little feedback to the authors could help them greatly,and might make your life easier atleast for testing comparitivly.

    Regards BanMe
    No hate for the lost children;
    more love for the paths we walk,
    'words' shatter the truth we seek.
    from the heart and mind of Me
    me, to you.. down and across

    No more words from me, to you...
    Hate and love shatter the heart and Mind of Me.
    For the Lost Children;For the paths we walk; the real truth we seek!

  9. #9
    Here's what MSVC does:
    Code:
              mov     edi, [ebp+arg_0]
              mov     ecx, [ebp+arg_4]
              shr     ecx, 7
              pxor    xmm0, xmm0
              jmp     short $L
              align 10h
    
    $L:
              movdqa  xmmword ptr [edi], xmm0
              movdqa  xmmword ptr [edi+10h], xmm0
              movdqa  xmmword ptr [edi+20h], xmm0
              movdqa  xmmword ptr [edi+30h], xmm0
              movdqa  xmmword ptr [edi+40h], xmm0
              movdqa  xmmword ptr [edi+50h], xmm0
              movdqa  xmmword ptr [edi+60h], xmm0
              movdqa  xmmword ptr [edi+70h], xmm0
              lea     edi, [edi+80h]
              dec     ecx
              jnz     short $L

  10. #10
    ...that code requires that your data is para aligned, whcih is rarely the case for smaller buffer.
    I want to know God's thoughts ...the rest are details.
    (A. Einstein)
    --------
    ..."a shellcode is a command you do at the linux shell"...

Similar Threads

  1. Controlling the speed of videos with the Deviare hooking engine
    By srw in forum Advanced Reversing and Programming
    Replies: 0
    Last Post: June 13th, 2012, 15:48

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •