fr33ke

Control Flow Deobfuscation Part 3

Rating: 2 votes, 1.50 average.
Now we have:

So to deobfuscate we only need one more thing:
  • A way to order the vertices


Interestingly, we could also make a obfuscator in this way. All we have to do is put the vertices in a random order.

So to deobfuscate we can't just use any ordering. We have to use one that is 'intuitive' and 'natural'. Unfortunately these words mean nothing to a computer; we'll have to specify them more carefully.

Let's start simple. In which order would you place these vertices if you wanted to deobfuscate?

I think we can agree that 1-2-3 is he right order.
3-1-2 is bad because we have to start with 1: this tells everyone that the function starts there.
1-3-2 is bad too, but why? For one it creates more branches than necessary, but I think this is not the reason it is hard to read. I think the reason is that we go backwards from 2 to 3. Most of us like to read in one direction: from top to bottom or from left to right.

Another one:

The right order here is 1-2-4-3 or 1-4-2-3.
1-2-3-4 is wrong because 4 then goes backwards to 3.

We can say: the order is right iff for all edges, the start of the edge comes before the end of the edge. This ordering is known as a topological ordering. There are various ways to compute it; it is important to remember that one is the reverse postorder of a depth-first traversal.

This might all sound like gibberish to you, so feel free to take a pause and look up some of the underlined words. Here's a little pseudocode example for the reverse postorder:
Code:
reverse_postorder(vertex):
    order = []
    for each child in vertex.edges
        order = reverse_postorder(child) + order

    order = vertex + order
    return order

Well, so far so good. But what to think of this one:

I'd say that the best ordering is 1-2-3-5-4. Some compilers order this as 1-5-2-3-4. Even 1-3-5-2-4 isn't unreasonable. But according to our ordering rules, they are all wrong.

If your graph has a cycle, it isn't possible to get a perfect ordering. We can't have 2 before 3, 3 before 5 and 5 before 2 at the same time, so we'll have to go backwards at least one time.

Something like 1-4-2-3-5 is even more wrong though. Now we have two backward edges, 3 to 4 and 5 to 2.

One way to solve the problem is like this:

We contract the vertices of the strongly connected component (a generalization of cycles) into one "supervertex". Now we order that and get 1-I-4. Then we order I: first we pick one vertex to start, I like to get the one that has most edges to it from outside I, but it doesn't matter much. In this example we pick 2. Then we order the rest (and if we find sub-SCC's in I we recurse). So here we get 3-5. Then the final order is 1-2-3-5-4.

Now thinking that out is one thing, implementing it is another. The most obvious problem is how to find the SCC's. Luckily some very smart people already figured it out for us. We'll use Tarjan's strongly connected components algorithm. It is a bit hard to understand, but http://www.ics.uci.edu/~eppstein/161/960220.html explains it very clearly.


So, are we done now? Is the following algorithm enough:
  1. Find all SCC's
  2. Collapse them into supervertices
  3. Order the resulting graph
  4. Apply this algorithm to all supervertices in the result

Yes, in a certain sense. It works fine, but it's quite a lot to code (especially in C) and it's pretty inefficient.


/ this is the clever part
Now both Tarjan's algorithm and the reverse postorder are based on depth first search. So maybe it's possible to combine them? It turns out we can indeed do this and save ourself a lot of work. The only modification we need to do is:
  • Whenever we find an SCC, we check if it is trivial (one vertex). If so, we add it to the order. Else we order the SCC and put the result in the order.


So the final algorithm is:
Code:
get_order(first_vertex, vertices_to_consider):
    /* Define some variables */
    order = []
    stack = empty_stack
    cur_dfsnum = 0
    index = []
    low = []

    /* Nested function, has access to above variables (closure) */
    visit(cur_vertex):
        index[cur_vertex] = cur_dfsnum
        low[cur_vertex] = cur_dfsnum
        cur_dfsnum++
        push cur_vertex on stack

        for each child in cur_vertex.edges where child in vertices_to_consider
            if index[child] == "To be done"
                visit(child)
                low[cur_vertex] = min(low[cur_vertex], low[child])
            else if index[child] == "Done"
                /* Do nothing */
            else
                low[cur_vertex] = min(low[cur_vertex], index[child])

        if low[cur_vertex] == index[cur_vertex]
            /* we found an SCC */
            scc = []
            do
                popped = pop from stack
                scc += popped
                index[popped] = "Done"
            until popped == current_vertex

            if scc.length == 1
                order = cur_vertex + order
            else
                order = get_order(choose_first(scc), scc) + order
    /* visit() ends */



    for vertex in vertices_to_consider:
        index[vertex] = "To be done"

    /* Special-case the start vertex to prevent infinite recursion */
    index[first_vertex] = "Done"
    for each child in first_vertex.edges where child in vertices_to_consider
        if index[child] == "To be done"
            visit(child)
    order = first_vertex + order

    return order
I omitted the choose_first function because it isn't important.


Now we have every part, putting it together is simple:
Code:
de_flow_obfuscate(bytecode):
    already_done_vertices = empty
    cfg = makecfg(bytecode)
    order = get_order(cfg, already_done_vertices)
    return rebuild_from_order(cfg, order)

Well that's it mostly. I wanted to show how to do this so maybe someone doesn't have to spend weeks figuring this stuff out and because I think
Code:
reverse postorder + Tarjan's strongly connected components algorithm = approximate topological ordering
is a very nice insight.

P.S. Sorry, my tool to do this for .NET is currently private because I'm not interested in an arms race. The ideas behind it are more important anyway

Submit "Control Flow Deobfuscation Part 3" to Digg Submit "Control Flow Deobfuscation Part 3" to del.icio.us Submit "Control Flow Deobfuscation Part 3" to StumbleUpon Submit "Control Flow Deobfuscation Part 3" to Google

Categories
Uncategorized

Comments