View Full Version : Any books on breaking simple codes found in malware?

May 13th, 2011, 19:45
I'm interested in simple cryptanalysis techniques that can be used to decode rot, vigenere, xor or whatever else malware might use. I know Perl, but math is not my area at all. I'm just looking for simple material on breaking codes used in malware.

The below books look interesting, but I'm not sure how relevant they are to current methods malware uses for encoding.

Codebreaker: The History of Codes and Ciphers

Between Silk and Cyanide: A Codemaker's War, 1941-1945

May 16th, 2011, 07:24
I've not read the books you have link above,I do know a book that talks about cryptovirlogy(it assumes you know cryptography)

There is another book you might wanna read "The Cryptoclub Using Mathematics to Make and Break Secret Codes" is a book an introduction to codes and ciphers for beginners or for middle schools

May 16th, 2011, 14:45
Wow, those books seem to be just what I was looking for. Thanks esther!

May 17th, 2011, 20:54

Here's a small example of a BlowFish algo as found in a malware. Not too sure how useful it might be, but..



May 31st, 2011, 08:13
Thanks, I also found National.Geographic.Code.Breakers online which was pretty interesting if anyone wants to check it out.

June 2nd, 2011, 19:10
Hi Sunk,

Just to share some quick wins handling encoded or encrypted malicious data that I have seen thus far.

Single Byte XOR - use brute force or WinHex histogram binning feature. Normally, most of the bytes in a MZ file are zero. If it is XOR encoded, the zeros will become that XOR key. i.e. key XOR zero equals key. This is especially useful in identifying encoded MZ file to be dumped by a dropper PDF or doc.

XOR Chaining - need to RE to identify the start of the encoded string. Sometimes it does forward or reverse chaining. E.g. For i = 2 to length, Data [I] ^= Data [i -1]

B64 - to recognize it, it looks like a non-language long string made of alphabets and numbers (sometimes with = or == appended). It is normally used when transmitting data to CnC. Do take note, the encoding scheme may be customised. You can look out for special strings in MZ used as encoding schemes such as "ABC ... YZabc ...y0123...9+/" or "abc ...yABC ... YZ0123...9+/." see wiki for standard string. http://en.wikipedia.org/wiki/Base64

RC4 - some RE is required. It is very easy to implement, and sbox is not required. Take note of loops that fill an array with 0,1,2,3, ... ,254,255. And, in another loop, it swaps and XOR at modulus 256 ring. See http://en.wikipedia.org/wiki/RC4 for implementation.

MD5 - it usually appear in the length of 32 hexascii (16 bytes). If you RE, you will see 3 functions supporting this - init, update and final. See java example. http://people.csail.mit.edu/cwo/projects/netcom_javadoc/netcom/util/MD5.html

Feistal Cipher and AES - examples of Feistal cipher are 3DES, DES, BlowFish... look out for length of input and output. You need to RE for the key. The tell tail signs are the SBOXes unless they customize, which is rare but can be easily done.

Hope this helps.

June 2nd, 2011, 19:53
@eiploader NICE! Your post helps a lot, thanks!

Just out of curiosity, how did you learn how to deal with encoded/encrypted malware? Is there a single source you can recommend, or was it knowledge you just picked up here and there over a period of time?

June 2nd, 2011, 21:25
Hi Sunk,

Glad to hear that the post is helpful. I used to do quite a fair bit of cyptography and these accumulated experience has probably help in my current work.

"Malware Analyst's Cookbook and DVD" talks a little bit about obfuscation and encoding found in malware. Other chapters are quite useful too.

"Applied Cryptography" is what I used during school days and previous job. This book is quite in-depth into algorithm implementation, which is helpful if you are doing RE.

June 3rd, 2011, 12:55
Awesome, I'll check out those books, thanks again!

July 3rd, 2011, 04:43
Sorry to interrupt another thread so late. Personally i would recommend reading some of Eric Filiol's 2010 conference papers. It's probably related to this thread.

@epiloader: But just a quick question. How do you determine that the S-Box is indeed for Blowfish and not for other Block Cipher provided they are customised?
As for Block Cipher, it will usually pad it to a block as the name suggested whereas for stream cipher. You give it 12bytes, you get back 12bytes for output.

Correct me if i am wrong anywhere.
I love learning.
This is an indeed interesting topic.

[ Gunther ]

July 3rd, 2011, 19:43
Hi Gunther,

Thanks for posting such a great question.
1. In the BlowFish example, it is actually using Feistel Cipher, which of course can be easily customized through modifying the F function. In cryptography, encryption primarily involves permutation (rearranging) and substitution (mapping), and one can modify the P or S of known function.

To help in recognizing the type of encryption used, I targeted in examining the SBOX, the mapping table used. In BlowFish, there are 4 SBOXes with 256 entries (32 bits per entry) in each box. And with that known tables, I can use it to juxtapose against the hardcoded values found in Malware. As mentioned, this is only a tell tail sign. One can customize other part of Round Function as long as function's properties are fulfiled. However, we could be confirmed by 2 following ways:
** *A) reverse the entire function and check the implementation of permutation and substitution. But not all reversers are crypto-trained to interpret the crypto functions;
** *B) reverse to get the key and a cipher. Try that out using readily available scripts.

Just an observation, there are 4G (2^32) choices to fit into 4x256 entries. Hence, the likelihood to use the same SBOX is low, unless the block cipher is customized using BlowFish (in this example).

2. The input and output length that I refers to is the I/O size of function of each block. E.g. AES takes in 2 inputs, key and data block. The key can be 128,192,256,... While the data block is 128 bits. The output is also 128bits. From here, there could be a chance that AES is used.
DES takes in 8 bytes key (but only 56 bits used) and data block size is 64 bits.
MD5, SHA1 and SHA2 has output length of 128, 160 and 256 bits respectively.
With that we can quickly guess the type of cipher algorithm used, but with no 100% guarantee. This is only one way to probe a black box.

The reason why SBOX values and length of I/O block are used is because it is a quick win to find out the type of algo used in malware.

Hope this clarifies my rationale.

Appreciate your question.

July 4th, 2011, 07:06
Sorry to reply so late as i just came back minutes ago. Correct me if i am wrong as i'm not a crypto-trained reverser.
But Blowfish is not the only one with 8bits input and 32bits output.
Snefru, Blowfish, CAST, and SQUARE all use 8bits input with 32bits output S-boxes.

For point 2, maybe it's easier for others to know that An S-box with p input bits
and q output bits is denoted p → q
S-boxes are designed for software implementation on 8-bit processors. The block ciphers with 8 → 8 S-boxes are SAFER, SHARK, and AES.
DES uses eight 6 → 4 S_boxes.

However, the way a crypto function is implemented is certainly different for every cipher.
I usually extract out the decryption algorithm and convert it into pseudo code and check against my crypto book to see which matches the closet.

But this tool is great for some instances though. Not sure whether you have used it before.

My pleasure to exchange pointers.
There are better crypto people out there and they are probably laughing at my knowledge. :P

[ Gunther ]

July 4th, 2011, 08:26
Hi Guther,

Thanks for sharing your thoughts and insights.

Regarding the statement that you have made earlier, "Snefru, BlowFish, CAST, and SQUARE all use 8bits input with 32bits output S-boxes."

I agree that it is not uncommon to see different ciphers using 8 bits input with 32 bits output S-Box, however, it is uncommon to see different cipher sharing the same SBox. Therefore, the point that I am driving is that the hard-coded values in the mapping table (SBox values) are unlikely to be the same in value and position. Hence, we could use it for clues.

Why do I say the chance is low to have exactly the same SBox?
For example in a case where the a SBox has 256 entries with 32 bits values each, the permutation is 2^32 x (2^32-1) x ... x (2^32 - 255) ways. To have the same permutation, the chance is approximated to be 1 in 2^(32x256).

For example, the first SBox value in BlowFish is D1310BA6, as compared to the first SBox value in CAST-128 is 30fb40d4.

Simply "googling" these hard coded values, you can easily find implementation that uses these values.
See example:

Guess what is the algorithm used if you see 0x63, 0x7C, 0x77, 0x7B in a SBOX?
Answer: http://www.google.com.sg/search?q=0x63%2C+0x7C%2C+0x77%2C+0x7B

And that's the reason why I say that they are tell tail sign to identifying. At least, it gets you started somewhere.


Probably, the best way to confirm it to try it against a suspected implementation. Checking against textbook is viable, but not always easy. Some implementations are easy to reverse into pseudo code, but very difficult to map it to crypto book due to the optimization of the implementation.

My pleasure to exchange thoughts too.

July 4th, 2011, 08:39
Yes, this is the answer i'm waiting for.

[ Gunther ]

July 4th, 2011, 09:13
Thanks for sharing your thoughts Gunther!
Appreciate that.