Control Flow Deobfuscation Part 1

Rate this Entry
Control Flow Obfuscation is one of the biggest protection mechanisms used in commercial protections for intermediate languages like IL (.NET) and Java bytecode. It is also used a bit in some native protectors, mostly to hide their own code, but I won't go into that. I'll use pseudocode for the examples.

The basic idea behind control flow obfuscation is that spaghetti code is hard to read. So the protectors transform normal code into spaghetti code, with lots of spurious branches. You can repair this by hand, but as functions get bigger this becomes a huge PITA. So we must write a program to do it for us.

I'll assume we have access to the bytecode in the program. Also we'll make use of some primitive operations on this bytecode. These problems are straightforward to tackle, all you need is the documentation and some (much ) time.

To reorganize the function we will use a two-step approach:
  1. Make a Control Flow Graph from the bytecode.
  2. Make bytecode from the CFG.

You might think "That doesn't do anything!", but the trick is that we read in bytecode that is obfuscated and write out bytecode that is simple to read. Both form the same control flow graph, so the meaning is not changed.

English example:
  • Obfuscated: "The bike, that is owned by John, who has blue eyes, needs to be repaired."
  • Graph:
  • Simplified: "John has blue eyes. His bike needs to be repaired."

Of course this isn't a perfect analogy but I hope you see what I mean.

Building the CFG
Let's see what we want from the CFG:
  • Have vertices with some amount of bytecode
  • Have edges that show the connection between vertices.
    • E.g. if vertex 1 branches to vertex 2 we want an edge from 1 to 2. Note that this edge has a direction: it doesn't go from 2 to 1.
    • We need a label on the edge that shows the type of connection. For instance if vertex 1 branches to vertex 2 if some condition is true, and to vertex 3 if it is false, we would have this graph:
  • Every vertex has one entry point and one exit point.

(if this is all gibberish for you, please read up on graphs before continuing)

The last point ensures our CFG is a proper Directed Graph.
In practice you'll want to keep the edge information with the vertex it comes from, but this changes nothing to the basic ideas.

With this info, we can make a simple recursive function to construct the CFG:
already_done_vertices = empty
cfg_of_function = makecfg(entry_point_of_function)

function makecfg(current_instruction):
    vertex myvertex

    if current_instruction is in already_done_vertices
        myvertex = the appropriate vertex in already_done_vertices

        myvertex.bytecode   = bytecode of current_instruction
        myvertex.branchtype = branch type of current_instruction
        myvertex.edges    = empty
        add myvertex to already_done_vertices
        for each target of current_instruction
            add makecfg(target) to myvertex.edges

    return myvertex
Note that the bytecode of a branch is nothing. Its only significance is in the branchtype and the targets. Most instructions have a branch type of "always" with a target that is the next instruction. The branch type relates to the targets, f.i. "top_of_stack != 0" may go to the first target if true, or the second target if false.

Of course in a real implementation you want to keep the number of vertices and edges low, so you lump together as many instructions as you can without breaking the rules, splitting them when necessary.

Well, that's it for now, in part 2 (and maybe 3) I'll show how to make bytecode from the CFG.

Submit "Control Flow Deobfuscation Part 1" to Digg Submit "Control Flow Deobfuscation Part 1" to Submit "Control Flow Deobfuscation Part 1" to StumbleUpon Submit "Control Flow Deobfuscation Part 1" to Google