Page 2 of 5 FirstFirst 12345 LastLast
Results 16 to 30 of 61

Thread: Good binary code profilers?

  1. #16
    (PART 2)

    I had pretty much stopped working on the first attempt, not sure if it would see the light of day.
    I'd kick my self (or slap my self in the head), thinking "there has to be a better way!".

    Then, I ran into Pedram's blog (and sort of hijacked it) one day:

    Wow! A way to do this in hardware you say?

    Used the "single step on branch", LBR ("last branch recording") method.
    Now this tool is really taking shape!
    Gone are the problems of "break point on every function" approach.
    No need to attempt a preprocess to get function entry points, no code modifications, etc.
    The hardware does many things for you. Looking at the branch recorders you can see the "to" and, "from" over every call.
    It's in hardware, no need for a megadollar hardware ICE, etc., it's all there in the CPU. With the right tool and setup anyone with a modern PC can use this.

    First here is a screen shot of a my current alpha tool.
    As a working title I call it "CFSearch" after "TSearch" ("CSearch" is already taken):

    What you see here is a list of call hits. On the tool bar, to the left is the "Save List", then "refresh", some filter, a pause/play button, etc.
    You run the GUI front end and attach it to what ever process you want (tool can only run on one at the time right now).
    The right panel is the "keeper list".
    You can select which thread, or all to attach too.
    Although it's "real time", a target process does slow down considerably while tracing (somewhere around 60 to 90% slower). If the process is multi-threaded (as my main test targets were), selecting only the main thread can really help speed wise.

    The single step branch thing works on pretty much every Intel CPU from P2'ish to present, and on the AMD64 generation or better.

    The setup steps (for user mode version):
    1) Again using an injected into target DLL for maximum speed and versatility, I install my own exception handler to handle the "single step on branch" exception.
    2) Set up the CPU MSR registers (for each logical core).
    3) Turn on the trap flag for threads via "SetThreadContext()" to start tracing.

    The heart of the action is in the exception handler.
    For maximum speed I create shared linear buffers for each code section to act as 32bit hit counters. This is a big part of the design, maybe a perfect hash function would work, etc., to save memory.
    The per call overhead this has to be minimal for a real time tool.

    Another component of the exception handler (the worker) is a mini-code analyzer to reject all branch exceptions except for calls.
    There is a little IPC between the front end the target DLL to synchronize some events, overall the DLL operates independently to again reduce overhead. The front end grabs a copy of the hits lists with some extra flags to apply delta/filter operations.

    Having the exception handler inside the process space makes it faster then external, but still it could be much faster.
    There is all that overhead in the OS from the hardware breakpoint, to the kernel, then the kernel dispatching the exception via LPC to the processes space, etc.
    Also it crashed a lot from exception frame conflicts in "ntdll.dll",.
    Probably the bigger (and continuing) issue is that Windows is not aware of the branch trace mechanism. It doesn't know you are using single step branch, it doesn't know you set those MSRs, it's pretty much dumb to the whole condition (understandably).

    The next step, I removed DLL exception handler component and replaced it with a simple KMD. It does a hook on int1 to handle the single step branch exception directly in kernel space.
    This gave the whole process a big speed boost. Gone is the the majority of user mode overhead, and now almost nonexistent crashing.
    Plus it's nearly transparent to the target process, with the exception of having to set the trap flag(s) in it.

    Some down side, for one it presently wouldn't work on Vista with out disabling patch guard since it's not "legal" in Vista to change interrupt vectors.
    Another, is that branch single step it's on a global level. So you can't say, debug in Olly, and run CFSearch at the same time because Olly is expecting the default behavior for single steps.
    But there are workarounds for these (some Olly setting to use HWBP step).
    It really needs a kernel hook on context switching to turn on and off depending on what process the context is in, etc.

    CURRENT STATE of research and the tool:
    My alpha tool works pretty well, but it could be better.
    When I run it on a big game, although workable, the game slows to a near crawl. Even using a KMD it's still relatively slow.
    Each processor exception takes so many cycles reguardless of how well I optimize the KMD code.

    If you look further in Intel manuals you will find the "data store" mechanism. Basically, the LBR can be set to be recored to a special buffer with out the need for an exception on every branch!
    The buffer can either be polled, or setup to IRQ when it's near full.
    This potentially could be a big speedup.
    Although maybe not as good as it sounds because apparently (from the single post or two I can find on the subject) the CPU operates in a less optimal mode when DS store is turned on.
    None the less, it has to be tried.

    Some downside, It's more processor version specific and it's not supported on the AMD64 (at least not publicly documented).

    This is where I am at now, and a bit stuck at the moment.
    There is only a tiny amount of information available on the DS store mechanism. The most of which is the raw description in the Intel manual.
    I can find no prior research on anyone attempting to do this on Windows.
    There is a tiny amount of documentation and source for Linux "perfmon".

    So far in my DS store attempts, even in the "polling" setup, it appears to only record the first branch and stop recording.
    Although while I have DS store setup (with buffers, all the debug MSR flags, etc.) the PC does slow down about 20%.
    That tells me it's at least partially turn on, but I must be missing something.

    Only targeting WindowsXP 32bit at the moment. Once working perhaps it could be extended to work on everything from XP to Vista, both 32 and 64bit.

    Note to have this working well, it will probably take some Windows kernel hacks to make it work right.
    At the very least a will need a kernel context switch hook to have it ON only while in the desired code spaces.
    Out of desperation I RE'ed VTune's driver a bit (not even sure it uses DS store yet although), and it indeed uses several kernel hooks.

    Any information would be appreciated, in particular anyone from Intel, AMD, etc.

    (Continued in part 3)

  2. #17
    (Part 3)

    More thoughts on this.

    #1 I mentioned game hacking. It's certainly not restricted to that, just something I do (and a lot of others) for fun.

    My main idea for such a tool is a real time reverse engineering tool that could be used in several situations.

    Some more uses:
    1) Code profilers.
    2) Extended real time debuggers.
    3) EXE unpacking tools.
    4) Virtualisation, sandboxing.
    And more..

    On the concept in general. So far I don't find the call hits as useful as I thought they would be. At looking at actual call chains (recording all the "from" and "to") would probably be more useful.
    This would be whole different design, and I'm not so sure it could be done in real time. Probably would require dumping it all (the potential thousands of calls per per second) and running some sort of post process on the data to make sense of it.

    A particular new product of interest is HBGary "Inspector":

    More of it here:

    Not only looking at code in this way, but also at data, etc.
    Unfortunately, I have no government contracts to pay for this research, I have to do it on my/our own.
    Anyone know the price of "Inspector"?
    (Hoglund, come to THAT IRC channel to talk)

    Some data tracking methods here:

    Also mentioned above is "AMD CodeAnalyst" tool. Anyone know what mechanisms it's using?
    Perhaps the undocumented AMD MSR registers here, and based on the name of some of the labels perhaps AMD has indeed their own DS store mechanism or similar:

    (End of Part 3, big post)
    Last edited by Sirmabus; February 15th, 2008 at 15:54.

  3. #18
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Ring -1
    Blog Entries
    Very interesting Sirmabus, I'm looking forward to part 2! ([EDIT] see my next post below)

    What you say about only breaking on each function is true: It increases speed a lot and thus can allow for close-to-realtime tracing, but it is also highly likely to miss some (important) functions, and also likely to crash because of misidentified function entrypoints, and finally we also have the more general problem of code patching/modification having to be done.

    So, this provoced an elaborated idea and possible solution to all this in my mind, although it is quite crazy and also has it's disadvantages (mostly speed, but also non-100% accuracy in the case of exceptions and code that is misidentified as data in the static pre-analysis of the executable), but other than that, it's pretty cool... So here it is:

    First of all, the accuracy and stability of the whole thing could be increased by doing it all on basic block level instead of on the function level. This would of course be at the cost of execution time, but as soon as you have some good basic filters (i.e. non-logged basic blocks that you have already discarded as uninteresting) this might actually still be acceptable/useful under some conditions.

    Then comes the really cool part:
    Not counting the possibility of exceptions, a code basic block only has two possible exit-points at most, i.e. taking a conditional jump or not, and in many cases it only has one (an unconditional jump or a static call, while the target of ret instructions and dynamic calls/jumps may have to be resolved dynamically). Adding exceptions to this, it has yet one additional possible exit-target (dynamically speaking).

    So, that leaves us with a maximum of three possible static exit targets from each basic block. Also, some instructions like ret or dynamic jumps/calls (jmp eax/call eax etc) has a dynamic exit-points which need to be taken care of, along with the quite dynamic exceptions handler destination (defined by FS:[0] etc) at any given point in the code.

    Well now, how many hardware breakpoints do we have to play with? Yes, that's right, four! This means that if we pre-analyse the code and all its basic blocks in IDA, and dump all this information about the possible exit-points of all basic blocks to disk, we can make a hardware-breakpoint-only based debugger-parasite tool for the target application, which will act according to the following pseudo-code for each basic block (when we enter the code, a hardware breakpoint has just hit, and the first such breakpoint will of course be placed on the global entrypoint of the executable):

    1. Was the just hit breakpoint at the beginning of a basic block? In that case, goto 2, else:
      1. If we get here, the hardware breakpoint that was just hit was not at the beginning of a basic block, and thus it was rather intended to resolve a dynamic block exit target (e.g. a modified FS[0] target or the current ret target, or the current call/jmp eax target etc), so do that.
      2. Place a new hardware breakpoint at the resolved dynamic block exit target.
      3. Return control to the debugged application.
    2. If we get here, the hardware breakpoint that was just hit was at the beginning of a basic block, so start out with dynamically resolving the current exception handler target, and put a hardware breakpoint on that.
    3. For each possible exit target of the current basic block (with a theoretical maximum of two, in addition to the already resolved exception handler exit target), place a hardware breakpoint on it.
    4. If the basic block ends with a "dynamic exit point", or contains a manipulation of FS[0] or on-stack EXCEPTION_REGISTRATION record somewhere in it (which would have to be detected statically before execution, which is the biggest flaw of this entire method of attack I think), put a hardware breakpoint on that too/instead, to be able to resolve it dynamically before it is executed.
    5. Return control to the debugged application.

    With the exception of sneakily modified exception handler targets, we can now trace the entire application by only using those four cheaply allotted hardware breakpoints!

    So you say, why is this better than single-stepping?

    • First of all, no single stepping flag that can be detected by the target appliciation will ever be set.
    • It will be much faster than single stepping in most cases.
    • Contrary to single stepping, filters can be used to include or exclude any exact code areas that you want, and also in a fully dynamic fashion.
    • Yes, hardware breakpoints can be detected too, but they are much less likely to be detected than software breakpoints since they don't modify any code or data in the memory, AND if they are just removed by the target application when they are presumably detected, you will know in which exact code basic block this was done, and will most likely be able to easily and quickly "fix" this little problem and continue.

    So, doesn't that sound at least a little cool, or what?
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  4. #19
    Cool idea, but still I think the fundamental problem is accurate preprocess analysis.
    It will be only as accurate as you can break apart the code before hand.
    Imagine what if the code was obfuscated, etc., you will have to have a very good analyzer, and, or some sort of emulator.
    And why have this extra step if you don't need it?

    (sorry we're a little out of sync, I was writing part 2 or 3 when you posted).

    If we could get the "DS store" working in Windows (and any other OS wanted) then it would be totally outside the process.
    Note too this is a different dynamic. Since it would be buffered, you are getting the "call" (branches too if you want) after the fact.
    With the exception way you could catch the call and do something else. Even doing a different type of hook for example, although not very practical.
    Last edited by Sirmabus; February 15th, 2008 at 15:54.

  5. #20
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Ring -1
    Blog Entries
    Aw crap, while I was posting my reply to Sirmabus' "part 1", he posted part 2 and 3 inbetween, and to make things even worse, he announced a dream tool in those posts that made my design above look pretty lame (but then again, you gotta give me some points for creativity, and also for doing it without any fancy schmancy extra custom processor features ).

    But anway, back to Sirmabus' tool... OMFG that is so unbelievably cool!!!!!1111!!

    Will you release this tool soon Sirmabus? Will you release its source code too? That would really be an extremely welcome contribution to the reversing community I think, and it would hopefully also result in more help for you to finish this project in the best way possible too!

    I have already created a CRCETL entry for it in anticipation of its arrival:

    Please update it with any news.

    And not to be too over-enthusiastic or anything, but it would really seem like YOU ARE DA FUXORING MAN!

    Also, the following thread might be of interest, where L. Spiro (author of Memory Hacking Software (MHS)) mentions that he has had one similar feature almost ready in his excellent tool:

    Hey, L. Spiro, is that feature fully implemented and included in current versions of MHS? You mention that you had a problem before because your tool did not have any kernel components, but if I'm not mistaken, it does now, right?

    Oh, and for reference the following thread is also quite related (looking back at it now, Sirmabus even mentions this exact tool in it, but in more secretive words! ):

    That thread in turn references the following thread, which is also related to the same topic/issue:
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  6. #21
    Thanks for your encouragement.
    I'll hit the Intel and AMD dev boards in hopes they'll give some answers.

  7. #22
    As dELTA says, having such a tool would be a very good thing for RCE purposes.


  8. #23
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Ring -1
    Blog Entries
    Ok, sounds great Sirmabus, please keep us posted on the progress of this tool, and also please feel very free to return here with any questions that we can be of assistance with!

    I'm actually pretty sure there are a bunch of people on this board who would be able to assist you with this problem if we can just find them and make them read this thread...

    Come on people, anyone? I know you're there, so why don't you just be a good chap and lend a hand now!?
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  9. #24
    Teach, Not Flame Kayaker's Avatar
    Join Date
    Oct 2000
    Blog Entries
    Quote Originally Posted by Sirmabus View Post
    It really needs a kernel hook on context switching to turn on and off depending on what process the context is in, etc.
    Intriguing work Sirmabus. Re the context switch, there's probably a good reason why what I'm about to say wouldn't work, else KAV might have done it already instead of using a crappy SwapContext hook. Is there any way to safely set a hardware rw breakpoint on _KPCR+124 (or FS:[124] or _KPRCB.CurrentThread if you wish)? This being the field which is constantly updated on context switches.

    I tried that in Softice actually, bpm FFDFF124 rw. It sort of worked, it would break within some ntice.sys function when the field was accessed, but after a few times I was greeted with the inevitable BSOD. I was just wondering if it might work without Softice in the picture.

  10. #25
    Humm, interesting. Don't see why a code HWBP in the kernel wouldn't work also.
    But, maybe not a good idea to put an exception in an area that is probably very performance intensive.

    I was just thinking of a typical binary code patch.
    You can see some examples of a ntoskrnl.exe "SwapContext()" hook in the "Tron", and some ARTeam source code, etc.
    Last edited by Sirmabus; February 17th, 2008 at 01:25.

  11. #26
    Teach, Not Flame Kayaker's Avatar
    Join Date
    Oct 2000
    Blog Entries
    If you can freely set and remove kernel breakpoints then you could also use the trick I mentioned in this thread. Some kind of SYM support would be needed to get the proper address unless you use a (possibly OS version dependant) pattern search.

    This being straight from the Softice manual itself (in the example 0xFF8B4020 is the ETHREAD you want to break on)

    Watch a thread being activated:
    bpx ntoskrnl!SwapContext IF (edi==0xFF8B4020)

    Watch a thread being deactivated:
    bpx ntoskrnl!SwapContext IF (esi==0xFF8B4020)

    This works because of the calling function @KiSwapContext where you can see how EDI and ESI are changed.

    :00404DB2 @KiSwapContext@4 proc near
    :00404DC4   mov     ebx, ds:0FFDFF01Ch ; PKPCR SelfPcr
    :00404DCA   mov     esi, ecx
    :00404DCC   mov     edi, [ebx+124h] ; Processor Control Region (KPCR) + 124h
    :00404DCC                           ; aka FS:[124]
    :00404DCC                           ; new (current) ETHREAD pointer
    :00404DD2   mov     [ebx+124h], esi ; old ETHREAD pointer
    :00404DD8   mov     cl, [edi+58h]
    :00404DDB   call    SwapContext

    The breaks would be your cue to turn your tracer on/off or whatever. It will of course break a lot

    A regular inline code hook would seem to require less overhead (in some ways) and probably wouldn't slow down the system as much, but it too has drawbacks (OS version dependance if byte pattern search required, possible incompatibility with KAV SwapContext hook, can't unload your driver unless you can guarantee no thread switches occur during unhooking, etc.)

  12. #27
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Ring -1
    Blog Entries
    Very interesting ideas, I hope they will lead to an optimal solution for this problem!

    Btw Sirmabus, will you post the links to your Intel/ADM developer board threads about this issue here, for reference, and so that also possibly even more people can read them in their full and help?

    Also, RolfRolles just posted a blog entry very related to this topic, introducing a cool tool. Still instrumentation based, so not playing in the same league as the Sirmabus tool currently being discussed here, but still cool, and a nice example of yet another tool in this area, DynamoRIO, and its plugin architecture!
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  13. #28
    Super Moderator
    Join Date
    Dec 2004
    Blog Entries
    well if you are a nix geek and can find the kernel components and compile kernel modules and do insmod sudo su install stuff you could take a look at some HITACHIS btracing implemtation stuff

    /* The development of this program is partly supported by IPA */
    /* (Information-Technology Promotion Agency, Japan). */

    /* bt_main.h - branch trace module header */
    /* Copyright: Copyright (c) Hitachi, Ltd. 2005-2007 */
    /* Authors: Yumiko Sugita (, */
    /* Satoshi Fujiwara (
    this was talked about in linux symposium

    the bunzip can be downloaded from sourceforge

    also kernel vger mailing lists has a discussion on btrace implementation under linux ptrace apis (look for ingo molner and markus from intel gmbh 's discussion )

    and if you are not averse to download beta nightly builds i think you can glean a few ideas from ptrace.c regarding btrace ds:save area setups etc

    btw congrats for your posts in intel devp board google returns you in first page first hit first link or maybe for lack of information in subject matter that google has to rank your
    forum question #1
    if you query DS SAVE area

  14. #29
    Thanks for the info. Good to talk with you again about the subject.

    I currently don't have a nix dev setup but contemplate building one.
    Having the source for the kernel, being able to modify and build it could be a big help.

    Hopefully, someone very knowledgeable in the area will show up.

    Probably just have to get back in and play around until I find the hardware flag or setting I'm missing.
    Reminds me, I used to work on console games back in the early 90's. What we had to do then, is read the bare tech manuals, and
    play around with the settings until we got our hardware blitters, etc., working (any one remember those days?), I'm just getting too lazy :-P

  15. #30
    Super Moderator
    Join Date
    Dec 2004
    Blog Entries
    if you are contemplating setting up one and would prefer an almost clone of windows (i mean clickety click with 100s of preinstalled toys) i can suggest you ubuntu but getting ubuntu kernel sources (they are not vanilla kernel available at is kinda tedious ) (it doesnt come with even gcc preinstalled (real doze style you need to apt get install gcc headers) to compile even a simple Hello World

    and thats one dvd full of installable os (get alternate install cd or iso or dvd so that if the installer borks you can attempt to manually install it from console root using some virtual environemt (contrary to the claims of minimum requirement of 256 mb spare ram capacity with alternate install method you can allot and successfully install a working vm image with as low as 32 mb allocation)
    and at the lowest extreme you can look at damn small linux at just ~ 50 mb os (fully functional and expandable )

    every one of these distros do work fine in vmware or yes even microsofts virtual pc or on the open source alternatives like virtual??box?? etc

    thanks and im also glad that we are talking again on the subject as well

Similar Threads

  1. Can't Dump a w32 binary (malware)?
    By digdugg in forum Malware Analysis and Unpacking Forum
    Replies: 5
    Last Post: January 17th, 2011, 15:14
  2. Good Laugh
    By NoLoader in forum Off Topic
    Replies: 3
    Last Post: August 30th, 2007, 05:12
  3. REQ: binary calculator
    By yaa in forum Tools of Our Trade (TOT) Messageboard
    Replies: 10
    Last Post: May 3rd, 2004, 04:33
  4. reversing the binary code of .exe and .dll
    By Alawi in forum Advanced Reversing and Programming
    Replies: 4
    Last Post: December 21st, 2001, 14:35
  5. Good time to get into cracking?
    By Unregistered in forum Malware Analysis and Unpacking Forum
    Replies: 3
    Last Post: November 11th, 2001, 13:16


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts