Page 1 of 5 12345 LastLast
Results 1 to 15 of 106

Thread: BGL (babylon glossary) to GLS (babylon glossary source).

Hybrid View

  1. #1
    acidmelt
    Guest

    BGL (babylon glossary) to GLS (babylon glossary source).

    hello reversers, a while back i tried to reverse the babylon GLS format so i would be able to read data out of it and use it in my own personal project, this task however is beyond my very noobish debugging skills and obviously i failed.

    i wasnt sure about where this post best fits in, it was either advanced reversing (since this requires pretty advanced practises) or the mini project area, feel free to move it.

    anyways this is the data i have gathered so far:
    the decryption algo can be found at the "babylon" program itself (http://www.babylon.com)
    the encryption algo can be found at the "babylon builder" program which is used to write dictionaries and is publicly available (http://www.babylon.com/builder)
    -----------
    there is zero documentation about this format available on the net.
    ive found this page (http://fjolliton.free.fr/babytrans/) which asserts that the new babylon bgl format is encrypted using the "Cipher Square" algorithm (http://www.esat.kuleuven.ac.be/~rijmen/square/).
    -----------
    after examining a few *.blg's it is visible that the first 8 bytes of the file are the signature.
    ive checked wotsit.org for documentation and found nothing.

    in a recent thread (http://www.woodmann.com/forum/showthread.php?t=6934) bilbo have suggested this as a project.. so i thought that id start this thread and see what happens.

    what do you say?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  2. #2
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    In my opinion, that would be a nice true RE activity, and not related to software stealing...
    You have my support, as long as I have time...!

    For the moment, I tell you how I would start...

    (1) Install Babylon - I have 5.0.1 r7 - dunno if last - and focalize on one BGL you have installed...
    The program is not compressed/protected in any way from debuggers.
    (2) Menu->Glossaries->Glossary Options, and remove your BGL
    (3) Attach to Babylon.exe with your preferred debugger and set a breakpoint on CreateFileA / ReadFile
    (4) Menu->Glossaries->Install glossary from disk, and reinstall your target BGL
    (5) Debugger will break at start of API: on stack you will find the return address and the BGL file name.
    (6) ... no time for now to go on...

    Best regards, bilbo
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  3. #3
    acidmelt
    Guest
    hey bilbo, i have tried your suggestion for debugging the app (using olly) and i have encountered a rather strange behaviour.. it seems that at startup babylon is iterating thru all the files inside %windir%\fonts and opening each one of them... i dont see any reason for that.
    anyways, i have stepped-over the code searching for the right CreateFile() and i wasnt able to find any reference that is opening a *.bgl.

    another problem was that as soon as i go into glossaries->add glossaries olly reports a memory access violation.. id guess that babylon does holds some sort of anti-reversal protection

    as i said my debugging skills are very limited and i would be glad if you (bilbo) or any other experienced reversers would take a look at that

    oh, and one last thing.. judging by the ram usage and the speed of seeking i assume that the glossaries are being loaded into memory at startup (duh.) so taking a memory dump should provide us with a valid copy of the decrypted gloss right?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  4. #4
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    Hello, acidmelt!

    I had some time to make other nice steps in "our" project. Let's see...
    Quote Originally Posted by acidmelt
    it seems that at startup babylon is iterating thru all the files inside %windir%\fonts and opening each one of them... i dont see any reason for that.
    No, that's not my case (I've checked with FILEMON). It could be that you have a yet fresh installation of Babylon, and it is yet auto-learning the fonts installed on your system for OCR. If that is the case you should also see an high CPU load for the following hours on your system.
    Quote Originally Posted by acidmelt
    anyways, i have stepped-over the code searching for the right CreateFile() and i wasnt able to find any reference that is opening a *.bgl.
    That was the reason I suggested you to put a breakpoint only after the initial phase and load a new BGL when the program is already started.
    Quote Originally Posted by acidmelt
    another problem was that as soon as i go into glossaries->add glossaries olly reports a memory access violation.. id guess that babylon does holds some sort of anti-reversal protection
    You're right, I don't use Olly and I did not noticed it. It is not a Memory Access Violation neither an anti-debugging trick. It is a lot of Exceptions C++ E06D7363. I dunno the exact reason. Anyway: Options->Debugging Options-> Exceptions->select Ignore Custom Exceptions and press button "Add last exception". This solves Olly problem!

    Quote Originally Posted by acidmelt
    oh, and one last thing.. judging by the ram usage and the speed of seeking i assume that the glossaries are being loaded into memory at startup (duh.)
    Correct!
    Quote Originally Posted by acidmelt
    so taking a memory dump should provide us with a valid copy of the decrypted gloss right?
    You have yet to localize the data and to interpret them, though!

    Quote Originally Posted by acidmelt
    Ive found this page (http://fjolliton.free.fr/babytrans/) which asserts that the new babylon bgl format is encrypted using the "Cipher Square" algorithm (http://www.esat.kuleuven.ac.be/~rijmen/square/).
    That's a wrong info, as far as I've seen!

    And now the good news.
    What you already found, the 4(8?)-bytes signature, can be of three types:
    12340003 .BDC extension - to be studied
    12340002 .BGL generated by the builder in some cases - to be studied
    12340001 .BGL distributed on Babylon site - I've started from these...

    I've managed to identify their decompression (not decryption) algorithm, using the 5 steps I suggested you. It is simply ZLIB, release 1.1.3 (rather old...). The routines are inside BabyServices.DLL, but they are called from BContentServer.DLL. I will tell you more details in the following messages if you are interested.

    Since the Library is completely free, and not GPL-ed, they cannot be blamed for performing a GPL violation, I suppose.

    Now, take one BGL of the last type, remove the first 0x47 bytes, and save it with a .GZ extension. The new file must start with 0x1F. Then you can extract it with WinZip, and you can browse its uncompressed contents.
    Not so bad, isn't it? There are many initial field we must discover yet, tough.

    If you want to play reversing some more, put a breakpoint at 0x9B29AF, run Baby and "Install glossary from disk" as I told you at step (4).
    You must land at this code
    Code:
    009B29AF   lea         ecx,[ebp-1030h]  ; uncompressed buffer to be filled
    009B29B5   push        1  ; number of bytes to uncompress
    009B29B7   push        ecx
    009B29B8   mov         ecx,dword ptr [ebp-1Ch]
    009B29BB   push        ecx  ; compression structure 64h bytes
    009B29BC   mov         ecx,eax  ; ZLIB object (Baby source is in C++)
    009B29BE   call        dword ptr [edx+18h]  ; inflate
    Execute the whole subroutine and you will find in the buffer the first uncompressed byte, 60 in my case. Try to discover the meaning of that value...
    I stop here at the moment... no more time.

    Best regards, bilbo

    P.S. JMI, I don't know if I can go on. Maybe the subforum is not correct, the matter is against rules, nobody else is interested, etc. etc.
    Please let me know...
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  5. #5
    Seems OK so far. Go for it.

    Regards,
    JMI

  6. #6
    acidmelt
    Guest
    bilbo that is some awesome information!

    here are my finding:
    to my surprise, after decompression the resulting files dont require any further decryption.. after scrolling a bit (offset 0xC47 at the eng_eng dictionary) you can see simple html tags and inbetween them are the definitions
    try changing the extension of the uncompressed file to html

    i have created a simple glossary with only 3 words to figure out the way that the definitions are aligned:
    TERM 0x000C DEFINITION 0x101809 TERM 0x000C and so on..
    however this is different in 12340001 bgls.. ill further analyse them.

    the byte at offset 0x5 points to the begining of the gzip header, convenient
    the gzip header of 12340001 files starts at 0x47 (as you said).
    the gzip header of 12340002 files starts at 0x69.

    on 12340003 (*.bdc) files however this is not the case.. this files seem to be uncompressed and it seems that their format is similer to the old *.dic.

    p.s im stupid, i totaly forgot about babylons ocr capabilities.

    thank you bilbo
    Last edited by acidmelt; April 21st, 2005 at 03:58.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  7. #7
    acidmelt
    Guest
    hey bilbo!

    thanks for the corrections
    in my previous code i have ignored some important details which made the parsing crippled.. anyways here is a fixed code incorporating zlib, so there is no need to manually unpack bgls
    Code:
    #include <stdio.h>
    #include <windows.h>
    #include "zlib.h"
    
    #pragma comment(lib,"zlib.lib")
    
    int isvalidchar(char ch);
    void stripjunk(char *buffer,char type);
    int focc(char *cstr,char ch);
    int uncomp_bgl(char *bglname,char *datname);
    int writegls(char *datname);
    
    char glsheader[1024];
    char glsheadertemplate[]=
    "### Glossary title:%s\r\n"
    "### Author:%s\r\n"
    "### Description:%s\r\n"
    "### Source language:English\r\n"
    "### Source alphabet:Default\r\n"
    "### Target language:English\r\n"
    "### Target alphabet:Default\r\n"
    "### Browsing enabled?No\r\n"
    "### Type of glossary:00000000\r\n"
    "### Case sensitive words?0\r\n"
    ";gls generated by bglgls\r\n\r\n"
    "### Glossary section:\r\n\r\n";
    
    int main(int argc,char **argv) {
    int ix;
    char szAuth[32];
    char szTitle[32];
    char szDescription[128];
    char datfname[128];
    
    if(argc!=2) { 
    	printf("usage: bglgls.exe filename.bgl\n"); 
    	return 0; 
    }
    //>get input
    printf("gls Author:");
    fgets(szAuth,32,stdin);
    printf("gls Title:");
    fgets(szTitle,32,stdin);
    printf("gls Description:");
    fgets(szDescription,128,stdin);
    
    szAuth[strlen(szAuth)-1]=0;
    szTitle[strlen(szTitle)-1]=0;
    szDescription[strlen(szDescription)-1]=0;
    sprintf(glsheader,glsheadertemplate,szAuth,szTitle,szDescription);
    //>set output filename
    strncpy(datfname,argv[1],128);
    ix=focc(datfname,'.');
    if(ix<0) { printf("invalid filename\n"); return 0; }
    datfname[ix]=0;
    strcat(datfname,".dat");
    //>>
    if(!uncomp_bgl(argv[1],datfname)) { printf("error uncompressing BGL.\n"); return 0; }
    if(!writegls(datfname)) { printf("error writing GLS.\n"); return 0; }
    return 0;
    }
    //>>uncompression routine
    int uncomp_bgl(char *bglname,char *datname) {
    FILE *ztmp;
    FILE *zfile;
    char iobuff[128];
    char tmppath[256];
    char tmpfname[256];
    unsigned char zptrbyte;
    int tread;
    
    //get temp filename
    GetTempPath(256,tmppath);
    GetTempFileName(tmppath,"bgl",0,tmpfname);
    ztmp=fopen(tmpfname,"wb");
    if(!ztmp) return 0;
    //>
    zfile=fopen(bglname,"rb");
    if(!zfile) return 0;
    fseek(zfile,0x5,SEEK_SET);
    fread(&zptrbyte,sizeof(char),1,zfile);
    printf("zlib header@0x%X\n",zptrbyte);
    fseek(zfile,zptrbyte,SEEK_SET);
    while(!feof(zfile)) {
    	tread=fread(iobuff,sizeof(char),128,zfile);
    	fwrite(iobuff,sizeof(char),tread,ztmp);
    }
    fclose(zfile);
    fclose(ztmp);
    //>>uncompressing >
    zfile=fopen(datname,"wb");
    ztmp=gzopen(tmpfname,"rb");
    if(!zfile||!ztmp) return 0;
    while(!gzeof(ztmp)) {
    	tread=gzread(ztmp,iobuff,128);
    	fwrite(iobuff,sizeof(char),tread,zfile);
    }
    gzclose(ztmp);
    fclose(zfile);
    DeleteFile(tmpfname); //get rid of temporary file
    return 1;
    }
    //write gls
    int writegls(char *datname) {
    FILE *fdic,*fgls;
    int ix,rec_length;
    short int lenword;
    unsigned char hdr,high_nibble,lenbyte;
    unsigned char lenmul,lenadd;
    unsigned long datapos;
    char tmpbuff[1024];
    char glsf[256];
    int tt=0,lt=0;
    
    //gls filename
    strcpy(glsf,datname);
    ix=focc(glsf,'.');
    glsf[ix]=0;
    strcat(glsf,".gls");
    printf("gls filename:%s\n",glsf);
    fgls=fopen(glsf,"wb");
    if(!fgls) return 0;
    //>write header
    printf("writing GLS");
    fwrite(glsheader,sizeof(char),strlen(glsheader),fgls);
    //>>parsing
    fdic=fopen(datname,"rb");
    if(!fdic) return 0;
    while(1) {
    	fread(&hdr,sizeof(char),1,fdic);
    	if(feof(fdic)) break;
    
    	//get record size
    	high_nibble=hdr >> 4;
    	if(high_nibble>=4) rec_length=high_nibble-4;
    	else {
    		for(ix=rec_length=0;ix<high_nibble+1;ix++) {
    			rec_length*=256;
    			fread(&lenbyte,sizeof(char),1,fdic);
    			rec_length+=lenbyte;
    		}
    	}
    	datapos=ftell(fdic);
    
    	switch(hdr & 0xF) {
    			case 1: {
    			fread(&lenbyte,sizeof(char),1,fdic);
    			memset(tmpbuff,0,1024);
    			fread(tmpbuff,sizeof(char),lenbyte,fdic);
    			if(!isalpha(tmpbuff[0])) break;
    			stripjunk(tmpbuff,0);
    			strcat(tmpbuff,"\r\n");
    			fwrite(tmpbuff,sizeof(char),strlen(tmpbuff),fgls);
    			fread(&lenmul,sizeof(char),1,fdic);
    				fread(&lenadd,sizeof(char),1,fdic);
    			memset(tmpbuff,0,1024);
    			lenword=lenmul*256+lenadd;
    			if(lenword>1019) lenword=1019;
    			fread(tmpbuff,sizeof(char),lenword,fdic);
    			stripjunk(tmpbuff,1);
    			strcat(tmpbuff,"\r\n\r\n");
    			fwrite(tmpbuff,sizeof(char),strlen(tmpbuff),fgls);
    			if(tt-100==lt) { lt=tt; printf("."); }
    			tt++;
    			} break;
    			default: break;
    	}
    	fseek(fdic,datapos+rec_length,SEEK_SET);
    }
    fclose(fdic);
    fclose(fgls);	
    DeleteFile(datname); //we dont need the *.dat anymore..
    printf("%d terms written to file!\n",tt);
    return 1;
    }
    //find occurrence
    int focc(char *cstr,char ch) { 
    int ix;
    for(ix=0;(unsigned)ix<strlen(cstr);ix++)
    	if(cstr[ix]==ch) return ix;
    return -1;
    }
    //>
    void stripjunk(char *buffer,char type) {
    int ix,slen;
    slen=strlen(buffer);
    
    if(!type) {
    	for(ix=1;ix<slen;ix++)
    		if(buffer[ix]=='$') { buffer[ix]=0; break; }
    	slen=ix;
    }	
    for(ix=0;ix<slen;ix++) 
    	if(!isvalidchar(buffer[ix])) { buffer[ix]=0; break; }
    }
    //valid term/definition char
    int isvalidchar(char ch) {
    	int ix;
    	char valtab[]="abcdefghijklmnopqrstuvwxyz 0123456789!@#$%&8()_-+=|{}[]<>\"',.%%/\\:;!?";
    	ch=tolower(ch);
    	for(ix=0;(unsigned)ix<strlen(valtab);ix++)
    		if(ch==valtab[ix]) return 1;
    return 0;
    }
    i have tested it with the code_analysis bgl that you suggested and it now works perfectly. i have also tested it with bablyons english_english dictionary (since it is the largest (18mb unpacked)) and it works really well.

    though something is missing and i couldnt figure it out.. babylon shows part-of-speech for each word and id guess that this information is stored in a table somewhere inside the bgl.. thats the last piece missing i believe.

    heres a binary compiled and linked with zlib.
    Attached Files Attached Files
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  8. #8
    hrmprog
    Guest
    hi
    i tried to use above code with unicode BGL but output file didn't complete
    which part of this code should be corrected?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  9. #9
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Location
    Ring -1
    Posts
    4,206
    Blog Entries
    5
    The one that fails. And why don't you debug it and tell us which one that is?
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  10. #10
    hrmprog
    Guest
    by use of above code with unicode BGL, in output file, there isn't any unicode letter and only english letter will appear. i try to conver english to farsi BGL, but in output file only english word appear.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  11. #11
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Location
    Ring -1
    Posts
    4,206
    Blog Entries
    5
    Windows uses special APIs to handle unicode strings, you must integrate these into the existing source code.
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  12. #12
    afree
    Guest

    HI

    Hi,
    Has anyone compiled it with these new changes (to work with arabic), and if Yes can he post it. I just can't Compile it
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  13. #13
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    Quote Originally Posted by afree
    I just can't Compile it
    Please don't be so categoric! Get a free compiler, get ZLIB.LIB (first hit in Google) and you too will be able to compile it.

    Best regards, bilbo
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  14. #14
    afree
    Guest

    hi

    I worked a little bit in C, but I almost forgot it all.
    Any way, I did manage to compile it, but something doesn't work
    Program starts, reads data(I think) but it doesnt write anything except for the header to the file. I will take a look at it later
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  15. #15
    hi. I have the same problem as hrmprog and waiting a long time to an answer in this post. But this seems not to be continued. in fact the main guys didn't go here since 2005!
    Many of the babylon BGLs are in unicode and so its very important to be able to handle unicode BGLs as well. I have little information in C coding and no success in manupulating acidmelts code for unicode. would someone please help me how to modify his code for unicode BGLs?
    dELTA should be right. But it's in theory. Thanks to acidmelt, the code is presented above. it will be appreciated if someone put the unicode corrected code here. thx

Similar Threads

  1. Dll source code
    By w_a_r_1 in forum The Newbie Forum
    Replies: 6
    Last Post: July 1st, 2009, 15:07
  2. I want to look at source code
    By mdhakk in forum The Newbie Forum
    Replies: 7
    Last Post: March 19th, 2005, 22:52
  3. help with asm source
    By LowF in forum The Newbie Forum
    Replies: 4
    Last Post: March 17th, 2003, 17:10
  4. VB source patch
    By current in forum Malware Analysis and Unpacking Forum
    Replies: 5
    Last Post: December 10th, 2000, 12:34

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •