# fr33ke

#### Control Flow Deobfuscation Part 3

by

, June 5th, 2008 at 06:09 (10535 Views)
Now we have:

- A way to decompose the bytecode of a function into its CFG
- A way to make bytecode from the CFG if we know the order of the vertices

So to deobfuscate we only need one more thing:

- A way to order the vertices

Interestingly, we could also make a obfuscator in this way. All we have to do is put the vertices in a random order.

So to deobfuscate we can't just use any ordering. We have to use one that is 'intuitive' and 'natural'. Unfortunately these words mean nothing to a computer; we'll have to specify them more carefully.

Let's start simple. In which order would you place these vertices if you wanted to deobfuscate?

I think we can agree that 1-2-3 is he right order.

3-1-2 is bad because wehaveto start with 1: this tells everyone that the function starts there.

1-3-2 is bad too, but why? For one it creates more branches than necessary, but I think this is not the reason it is hard to read. I think the reason is that we gobackwardsfrom 2 to 3. Most of us like to read in one direction: from top to bottom or from left to right.

Another one:

The right order here is 1-2-4-3 or 1-4-2-3.

1-2-3-4 is wrong because 4 then goes backwards to 3.

We can say: the order is right iff for all edges, the start of the edge comesbeforethe end of the edge. This ordering is known as atopological ordering. There are various ways to compute it; it is important to remember that one is thereverse postorderof adepth-first traversal.

This might all sound like gibberish to you, so feel free to take a pause and look up some of the underlined words. Here's a little pseudocode example for the reverse postorder:

Code:reverse_postorder(vertex): order = [] for each child in vertex.edges order = reverse_postorder(child) + order order = vertex + order return order

Well, so far so good. But what to think of this one:

I'd say that the best ordering is 1-2-3-5-4. Some compilers order this as 1-5-2-3-4. Even 1-3-5-2-4 isn't unreasonable. But according to our ordering rules,they are all wrong.

If your graph has a cycle, it isn't possible to get a perfect ordering. We can't have 2 before 3, 3 before 5 and 5 before 2 at the same time, so we'll have to go backwards at least one time.

Something like 1-4-2-3-5 is even more wrong though. Now we have two backward edges, 3 to 4 and 5 to 2.

One way to solve the problem is like this:

Wecontractthe vertices of thestrongly connected component(a generalization of cycles) into one "supervertex". Now we order that and get 1-I-4. Then we order I: first we pick one vertex to start, I like to get the one that has most edges to it from outside I, but it doesn't matter much. In this example we pick 2. Then we order the rest (and if we find sub-SCC's in I we recurse). So here we get 3-5. Then the final order is 1-2-3-5-4.

Now thinking that out is one thing, implementing it is another. The most obvious problem is how to find the SCC's. Luckily some very smart people already figured it out for us. We'll useTarjan's strongly connected components algorithm. It is a bit hard to understand, but http://www.ics.uci.edu/~eppstein/161/960220.html explains it very clearly.

So, are we done now? Is the following algorithm enough:

- Find all SCC's
- Collapse them into supervertices
- Order the resulting graph
- Apply this algorithm to all supervertices in the result

Yes, in a certain sense. It works fine, but it's quite a lot to code (especially in C) and it's pretty inefficient.

/ this is the clever part

Now both Tarjan's algorithm and the reverse postorder are based on depth first search. So maybe it's possible to combine them? It turns out we can indeed do this and save ourself a lot of work. The only modification we need to do is:

- Whenever we find an SCC, we check if it is trivial (one vertex). If so, we add it to the order. Else we order the SCC and put the result in the order.

So the final algorithm is:

I omitted the choose_first function because it isn't important.Code:get_order(first_vertex, vertices_to_consider): /* Define some variables */ order = [] stack = empty_stack cur_dfsnum = 0 index = [] low = [] /* Nested function, has access to above variables (closure) */ visit(cur_vertex): index[cur_vertex] = cur_dfsnum low[cur_vertex] = cur_dfsnum cur_dfsnum++ push cur_vertex on stack for each child in cur_vertex.edges where child in vertices_to_consider if index[child] == "To be done" visit(child) low[cur_vertex] = min(low[cur_vertex], low[child]) else if index[child] == "Done" /* Do nothing */ else low[cur_vertex] = min(low[cur_vertex], index[child]) if low[cur_vertex] == index[cur_vertex] /* we found an SCC */ scc = [] do popped = pop from stack scc += popped index[popped] = "Done" until popped == current_vertex if scc.length == 1 order = cur_vertex + order else order = get_order(choose_first(scc), scc) + order /* visit() ends */ for vertex in vertices_to_consider: index[vertex] = "To be done" /* Special-case the start vertex to prevent infinite recursion */ index[first_vertex] = "Done" for each child in first_vertex.edges where child in vertices_to_consider if index[child] == "To be done" visit(child) order = first_vertex + order return order

Now we have every part, putting it together is simple:

Code:de_flow_obfuscate(bytecode): already_done_vertices = empty cfg = makecfg(bytecode) order = get_order(cfg, already_done_vertices) return rebuild_from_order(cfg, order)

Well that's it mostly. I wanted to show how to do this so maybe someone doesn't have to spend weeks figuring this stuff out and because I thinkis a very nice insight.Code:reverse postorder+Tarjan's strongly connected components algorithm= approximatetopological ordering

P.S. Sorry, my tool to do this for .NET is currently private because I'm not interested in an arms race. The ideas behind it are more important anyway