Page 1 of 8 12345678 LastLast
Results 1 to 15 of 106

Thread: BGL (babylon glossary) to GLS (babylon glossary source).

  1. #1
    acidmelt
    Guest

    BGL (babylon glossary) to GLS (babylon glossary source).

    hello reversers, a while back i tried to reverse the babylon GLS format so i would be able to read data out of it and use it in my own personal project, this task however is beyond my very noobish debugging skills and obviously i failed.

    i wasnt sure about where this post best fits in, it was either advanced reversing (since this requires pretty advanced practises) or the mini project area, feel free to move it.

    anyways this is the data i have gathered so far:
    the decryption algo can be found at the "babylon" program itself (http://www.babylon.com)
    the encryption algo can be found at the "babylon builder" program which is used to write dictionaries and is publicly available (http://www.babylon.com/builder)
    -----------
    there is zero documentation about this format available on the net.
    ive found this page (http://fjolliton.free.fr/babytrans/) which asserts that the new babylon bgl format is encrypted using the "Cipher Square" algorithm (http://www.esat.kuleuven.ac.be/~rijmen/square/).
    -----------
    after examining a few *.blg's it is visible that the first 8 bytes of the file are the signature.
    ive checked wotsit.org for documentation and found nothing.

    in a recent thread (http://www.woodmann.com/forum/showthread.php?t=6934) bilbo have suggested this as a project.. so i thought that id start this thread and see what happens.

    what do you say?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  2. #2
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    In my opinion, that would be a nice true RE activity, and not related to software stealing...
    You have my support, as long as I have time...!

    For the moment, I tell you how I would start...

    (1) Install Babylon - I have 5.0.1 r7 - dunno if last - and focalize on one BGL you have installed...
    The program is not compressed/protected in any way from debuggers.
    (2) Menu->Glossaries->Glossary Options, and remove your BGL
    (3) Attach to Babylon.exe with your preferred debugger and set a breakpoint on CreateFileA / ReadFile
    (4) Menu->Glossaries->Install glossary from disk, and reinstall your target BGL
    (5) Debugger will break at start of API: on stack you will find the return address and the BGL file name.
    (6) ... no time for now to go on...

    Best regards, bilbo
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  3. #3
    acidmelt
    Guest
    hey bilbo, i have tried your suggestion for debugging the app (using olly) and i have encountered a rather strange behaviour.. it seems that at startup babylon is iterating thru all the files inside %windir%\fonts and opening each one of them... i dont see any reason for that.
    anyways, i have stepped-over the code searching for the right CreateFile() and i wasnt able to find any reference that is opening a *.bgl.

    another problem was that as soon as i go into glossaries->add glossaries olly reports a memory access violation.. id guess that babylon does holds some sort of anti-reversal protection

    as i said my debugging skills are very limited and i would be glad if you (bilbo) or any other experienced reversers would take a look at that

    oh, and one last thing.. judging by the ram usage and the speed of seeking i assume that the glossaries are being loaded into memory at startup (duh.) so taking a memory dump should provide us with a valid copy of the decrypted gloss right?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  4. #4
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    Hello, acidmelt!

    I had some time to make other nice steps in "our" project. Let's see...
    Quote Originally Posted by acidmelt
    it seems that at startup babylon is iterating thru all the files inside %windir%\fonts and opening each one of them... i dont see any reason for that.
    No, that's not my case (I've checked with FILEMON). It could be that you have a yet fresh installation of Babylon, and it is yet auto-learning the fonts installed on your system for OCR. If that is the case you should also see an high CPU load for the following hours on your system.
    Quote Originally Posted by acidmelt
    anyways, i have stepped-over the code searching for the right CreateFile() and i wasnt able to find any reference that is opening a *.bgl.
    That was the reason I suggested you to put a breakpoint only after the initial phase and load a new BGL when the program is already started.
    Quote Originally Posted by acidmelt
    another problem was that as soon as i go into glossaries->add glossaries olly reports a memory access violation.. id guess that babylon does holds some sort of anti-reversal protection
    You're right, I don't use Olly and I did not noticed it. It is not a Memory Access Violation neither an anti-debugging trick. It is a lot of Exceptions C++ E06D7363. I dunno the exact reason. Anyway: Options->Debugging Options-> Exceptions->select Ignore Custom Exceptions and press button "Add last exception". This solves Olly problem!

    Quote Originally Posted by acidmelt
    oh, and one last thing.. judging by the ram usage and the speed of seeking i assume that the glossaries are being loaded into memory at startup (duh.)
    Correct!
    Quote Originally Posted by acidmelt
    so taking a memory dump should provide us with a valid copy of the decrypted gloss right?
    You have yet to localize the data and to interpret them, though!

    Quote Originally Posted by acidmelt
    Ive found this page (http://fjolliton.free.fr/babytrans/) which asserts that the new babylon bgl format is encrypted using the "Cipher Square" algorithm (http://www.esat.kuleuven.ac.be/~rijmen/square/).
    That's a wrong info, as far as I've seen!

    And now the good news.
    What you already found, the 4(8?)-bytes signature, can be of three types:
    12340003 .BDC extension - to be studied
    12340002 .BGL generated by the builder in some cases - to be studied
    12340001 .BGL distributed on Babylon site - I've started from these...

    I've managed to identify their decompression (not decryption) algorithm, using the 5 steps I suggested you. It is simply ZLIB, release 1.1.3 (rather old...). The routines are inside BabyServices.DLL, but they are called from BContentServer.DLL. I will tell you more details in the following messages if you are interested.

    Since the Library is completely free, and not GPL-ed, they cannot be blamed for performing a GPL violation, I suppose.

    Now, take one BGL of the last type, remove the first 0x47 bytes, and save it with a .GZ extension. The new file must start with 0x1F. Then you can extract it with WinZip, and you can browse its uncompressed contents.
    Not so bad, isn't it? There are many initial field we must discover yet, tough.

    If you want to play reversing some more, put a breakpoint at 0x9B29AF, run Baby and "Install glossary from disk" as I told you at step (4).
    You must land at this code
    Code:
    009B29AF   lea         ecx,[ebp-1030h]  ; uncompressed buffer to be filled
    009B29B5   push        1  ; number of bytes to uncompress
    009B29B7   push        ecx
    009B29B8   mov         ecx,dword ptr [ebp-1Ch]
    009B29BB   push        ecx  ; compression structure 64h bytes
    009B29BC   mov         ecx,eax  ; ZLIB object (Baby source is in C++)
    009B29BE   call        dword ptr [edx+18h]  ; inflate
    Execute the whole subroutine and you will find in the buffer the first uncompressed byte, 60 in my case. Try to discover the meaning of that value...
    I stop here at the moment... no more time.

    Best regards, bilbo

    P.S. JMI, I don't know if I can go on. Maybe the subforum is not correct, the matter is against rules, nobody else is interested, etc. etc.
    Please let me know...
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  5. #5
    Seems OK so far. Go for it.

    Regards,
    JMI

  6. #6
    acidmelt
    Guest
    bilbo that is some awesome information!

    here are my finding:
    to my surprise, after decompression the resulting files dont require any further decryption.. after scrolling a bit (offset 0xC47 at the eng_eng dictionary) you can see simple html tags and inbetween them are the definitions
    try changing the extension of the uncompressed file to html

    i have created a simple glossary with only 3 words to figure out the way that the definitions are aligned:
    TERM 0x000C DEFINITION 0x101809 TERM 0x000C and so on..
    however this is different in 12340001 bgls.. ill further analyse them.

    the byte at offset 0x5 points to the begining of the gzip header, convenient
    the gzip header of 12340001 files starts at 0x47 (as you said).
    the gzip header of 12340002 files starts at 0x69.

    on 12340003 (*.bdc) files however this is not the case.. this files seem to be uncompressed and it seems that their format is similer to the old *.dic.

    p.s im stupid, i totaly forgot about babylons ocr capabilities.

    thank you bilbo
    Last edited by acidmelt; April 21st, 2005 at 03:58.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  7. #7
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    Hello, acidmelt, and everyone interested (nobody seems to be...),

    try changing the extension of the uncompressed file to html
    ok, but that is just a resource... the whole dictionary is not HTML format
    the byte at offset 0x5 points to the begining of the gzip header, convenient
    great... I would say at offset 0x4, though, because all the Baby entities are in big-endian form (the high byte first, read on)
    the gzip header of 12340002 files starts at 0x69
    great! one point to you!

    And now the step for today...

    I started from the address I told you yesterday and I have reversed some stuff here and there (sub_9B1DCO and related ones). These are my findings.

    The uncompressed file is a collection of records.
    Every record has a one-byte header.
    The low nibble is the record type.
    The high nibble holds indication of the record length, with the following rule:

    high nibble>=4: subtract 4; that is the length
    high nibble <4: add 1: that is the number of bytes for the following length (in big-endian format)

    As for the record types:
    0 - one-byte specifier will follow, and the data next
    1 - this is an entry: the entry name will follow as a string preceded by one byte for length, and the definition next
    2 - this is a named resource: the resource name will follow as above (e.g xxx.bmp, xxx.html) (and the data next)
    3 - two byte specifier will follow, and the data next
    4/6 - no specifier, 0 bytes of data - type 6 is at end

    But I hate the theory, so here is a little program which will scan the whole uncompressed file.
    I have tried it successully on a little BGL: Code Analysis, at http://info.babylon.com/gl_index/gl_template.php?id=46760

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    void
    main(int argc, char **argv)
    {
    	char resname[256];
    	unsigned char hdr, high_nibble, lenbyte;
    	unsigned char specifier[2];
    	int i, record_length;
    	FILE *fpin;
    	long curpos, datapos;
    
    	if (argc != 2) {
    		printf("usage: %s uncompressed_filename\n", argv[0]);
    		return;
    		}
    
    	fpin = fopen(argv[1], "rb");
    	if (!fpin) goto ko;
    
    		// a record per loop
    	while (1) {
    		curpos = ftell(fpin);
    		fread(&hdr, 1, sizeof(hdr), fpin);
    		if (feof(fpin)) return;
    
    			// get the record size
    		high_nibble = hdr >> 4;
    		if (high_nibble >= 4) record_length = high_nibble - 4;
    		else for (i=record_length=0; i<high_nibble+1; i++) {
    			record_length *= 256;
    			fread(&lenbyte, 1, sizeof(lenbyte), fpin);
    			record_length += lenbyte;
    			}
    		datapos = ftell(fpin);
    
    		switch (hdr & 0xF) {  // low nibble
    
    		case 0:  // one-byte specifier follows
    			fread(specifier, 1, 1, fpin);
    			printf("@%x: <id %x> %x bytes\n",
    				curpos, specifier[0], record_length);
    			break;
    		case 3:  // two-bytes specifier follows
    			fread(specifier, 1, 2, fpin);
    			printf("@%x: <id %x> %x bytes\n",
    				curpos, specifier[0]*256+specifier[1], record_length);
    			break;
    		case 4:  // no specifier
    		case 6:  // no specifier
    			printf("@%x: <no id(%d)> %x bytes\n",
    				curpos, hdr&0xF, record_length);
    			break;
    
    		case 2:  // named resource
    			fread(&lenbyte, 1, sizeof(lenbyte), fpin);
    			fread(resname, 1, lenbyte, fpin);
    			printf("@%x: <res %.*s> %x bytes\n",
    				curpos, lenbyte, resname, record_length);
    			break;
    
    		case 1:  // entry
    			fread(&lenbyte, 1, sizeof(lenbyte), fpin);
    			fread(resname, 1, lenbyte, fpin);
    			printf("@%x: <entry> \"%.*s\"> %x bytes\n",
    				curpos, lenbyte, resname, record_length);
    			break;
    
    		default:
    			printf("unexpected low_nibble %x\n", hdr & 0xF);
    			return;
    		}
    		fseek(fpin, datapos+record_length, SEEK_SET);
    		}
    
    	return;
    ko:
    	printf("exit due to error %d: %s\n", errno, strerror(errno));
    }
    We need only to understand the meaning of the specifiers...
    Best regards, bilbo
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  8. #8
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Location
    Ring -1
    Posts
    4,204
    Blog Entries
    5
    Nice work as always bilbo.

    and everyone interested (nobody seems to be...)
    Sure we are, just lurking. Keep up the good work.

  9. #9
    acidmelt
    Guest
    hey bilbo!

    again thats plenty of great information.. thank you

    i wrote a little program to explore uncompressed bgls based on your code

    Code:
    #include <stdio.h>
    #include <windows.h>
    #include <conio.h>
    
    int isvalidchar(char ch);
    void stripjunk(char *buffer);
    
    struct bdc {
    	char szTerm[256];
    	char szDefinition[256];
    } **babyterm[27]; 
    int ptrcnt[27]; //sorted
    
    void main(int argc,char **argv) {
    FILE *fdic;
    int ix,iy,rec_length;
    unsigned char hdr,high_nibble,lenbyte,tmpch;
    unsigned long datapos;
    char tmpbuff[256];
    char uterm[256];
    int bg,eg;
    
    if(argc!=2) { printf("usage: %s uncompressed_filename\n", argv[0]); return; }
    //initial allocation of pointers
    for(ix=0;ix<27;ix++)  {
    	babyterm[ix]=(struct bdc**)malloc(sizeof(struct bdc*));
    	ptrcnt[ix]=0;
    }
    //>>parsing
    fdic=fopen(argv[1],"rb");
    if(!fdic) { printf("error opening file [%s].\n",argv[1]); return; }
    bg=GetTickCount();
    while(1) {
    	fread(&hdr,sizeof(char),1,fdic);
    	if(feof(fdic)) break;
    
    	//get record size
    	high_nibble=hdr >> 4;
    	if(high_nibble>=4) rec_length=high_nibble-4;
    	else {
    		for(ix=rec_length=0;ix<high_nibble+1;ix++) {
    			rec_length*=256;
    			fread(&lenbyte,sizeof(char),1,fdic);
    			rec_length+=lenbyte;
    		}
    	}
    	datapos=ftell(fdic);
    
    	switch(hdr & 0xF) {
    			case 1: {
    			fread(&lenbyte,sizeof(char),1,fdic);
    			memset(tmpbuff,0,lenbyte+1);
    			fread(tmpbuff,sizeof(char),lenbyte,fdic);
    			if(!isalpha(tmpbuff[0])) break;
    			stripjunk(tmpbuff);
    			//printf("TERM [%s] -> \n",tmpbuff);
    			//>>allocating space for term struct
    			tmpch=tolower(tmpbuff[0])-'a';
    			babyterm[tmpch][ptrcnt[tmpch]]=(struct bdc*)malloc(sizeof(struct bdc));
    			if(babyterm[tmpch][ptrcnt[tmpch]]==NULL) {
    				printf(":O ran out of space.\n");
    				return;
    			}
    			strcpy(babyterm[tmpch][ptrcnt[tmpch]]->szTerm,tmpbuff);
    			//>>
    			fseek(fdic,1,SEEK_CUR); //definiton lenbyte is next
    			fread(&lenbyte,sizeof(char),1,fdic);
    			memset(tmpbuff,0,lenbyte+1);
    			fread(tmpbuff,sizeof(char),lenbyte,fdic);
    			stripjunk(tmpbuff);
    			strcpy(babyterm[tmpch][ptrcnt[tmpch]]->szDefinition,tmpbuff);
    			//printf("DEF [%s]\n",tmpbuff);
    			ptrcnt[tmpch]++;
    			} break;
    			default: break;
    	}
    	fseek(fdic,datapos+rec_length,SEEK_SET);
    }
    eg=GetTickCount();
    fclose(fdic);	
    
    printf("total parsing time: %dms\n",eg-bg);
    
    printf("--------------------------\n");
    for(ix=0;ix<27;ix++) {
    	if(ptrcnt[ix]>0) {
    		for(iy=0;iy<ptrcnt[ix];iy++) {
    		printf("--\n[%s][%s]\n",babyterm[ix][iy]->szTerm,babyterm[ix][iy]->szDefinition);
    		if(getch()==27) goto takeinp;
    		}
    	}
    }
    printf("--------------------------\n\n");
    takeinp:
    for(;;) { //input loop
    	memset(uterm,0,256);
    	printf("Term:");
    	scanf("%256s",uterm);
    	if(uterm[0]) {
    		tmpch=tolower(uterm[0])-'a';
    		for(ix=0;ix<ptrcnt[tmpch];ix++) 
    			if(!strcmpi(babyterm[tmpch][ix]->szTerm,uterm))
    			printf("%s = \n%s\n",uterm,babyterm[tmpch][ix]->szDefinition);
    	}
    }
    }
    
    void stripjunk(char *buffer) {
    int ix,slen;
    slen=strlen(buffer);
    
    for(ix=1;ix<slen;ix++)
    	if(buffer[ix]=='$') { buffer[ix]=0; break; }
    slen=ix;
    for(ix=0;ix<slen;ix++) 
    	if(!isvalidchar(buffer[ix])) { buffer[ix]=0; break; }
    	
    }
    
    int isvalidchar(char ch) {
    	int ix;
    	char valtab[]="abcdefghijklmnopqrstuvwxyz 0123456789!@#$%&8()_-+=|{}[]<>\"',.%%/\\:;!?";
    	ch=tolower(ch);
    
    	for(ix=0;(unsigned)ix<strlen(valtab);ix++)
    		if(ch==valtab[ix]) return 1;
    return 0;
    }
    however whats about the rest of the data?
    it seems as if the uncompressed files have some sort of header?

    [edit]
    i just took a look at some *.gls and i belive our goal is completed (well you did most of the work so kudos to you)

    the format is really simple:
    ### Glossary title:testTitle
    ### Author:testAuthor
    ### Description:testGlossDescription
    ### Source language:English
    ### Source alphabetefault
    ### Browsing enabled?No
    ### Type of glossary:00000000
    ### Case sensitive words?0
    ### Glossary section:

    test1
    meaning1

    test2
    meaning2

    test3
    meaning3
    --------
    using the code above it is really easy to produce gls's..
    [/edit]
    Last edited by acidmelt; April 22nd, 2005 at 04:05.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  10. #10
    son of Bungo & Belladonna bilbo's Avatar
    Join Date
    Mar 2004
    Location
    Rivendell
    Posts
    310
    Good, acidmelt, you added some indexing feature (dynamic array 'babyterm'), but... there is a bug...

    You initialized babyterm[ix] just at init time with only one pointer in it! In this way the entries are overwritten as they grow.
    You can remove the whole initialization loop, but you must add, before every new entry allocation, a resizing of the **babyterm array:
    Code:
    babyterm[tmpch] = (struct bdc**)realloc(babyterm[tmpch],
                               (ptrcnt[tmpch]+1)*sizeof(struct bdc*));
    babyterm[tmpch][ptrcnt[tmpch]] = (struct bdc*)malloc(sizeof(struct bdc));
    instead of the simple
    Code:
    babyterm[tmpch][ptrcnt[tmpch]] = (struct bdc*)malloc(sizeof(struct bdc));
    By the way, realloc will work also the first time, when the area to reallocate has address 0.

    Ok.
    And you removed a lot of things: not just spaces in the source, I see, you don't like spaces ); but also non ASCII characters which are used as quotes or underscores, etc. If you try your program on the BGL I suggested, many definitions are cut.

    That's all for this weekend, I have other things to do...
    A simple addition would be to integrate ZLIB in the program in order to uncompress the file automatically...

    Best regards, bilbo

    P.S. thx dELTA (and acidmelt) for appreciation...
    P.P.S. I suggest to have a look at the dictionary I linked in my previous message, there is also something for Fravia
    +Fravia: One of the best reverser in the world. Founder of +Fravia's Pages of Reverse Engineering
    and for our friend Zero
    Universitas Virtualis: Free knowledge project which provides a professional place for Algorithms, Software-Engineering, Software-Protection and Reverse Code Engineering, Cryptography and Cryptanalysis.
    Non quia difficilia sunt, non audemus, sed quia non audemus, difficilia sunt.[Seneca, Epistulae Morales 104, 26]

  11. #11
    acidmelt
    Guest
    hey bilbo!

    thanks for the corrections
    in my previous code i have ignored some important details which made the parsing crippled.. anyways here is a fixed code incorporating zlib, so there is no need to manually unpack bgls
    Code:
    #include <stdio.h>
    #include <windows.h>
    #include "zlib.h"
    
    #pragma comment(lib,"zlib.lib")
    
    int isvalidchar(char ch);
    void stripjunk(char *buffer,char type);
    int focc(char *cstr,char ch);
    int uncomp_bgl(char *bglname,char *datname);
    int writegls(char *datname);
    
    char glsheader[1024];
    char glsheadertemplate[]=
    "### Glossary title:%s\r\n"
    "### Author:%s\r\n"
    "### Description:%s\r\n"
    "### Source language:English\r\n"
    "### Source alphabet:Default\r\n"
    "### Target language:English\r\n"
    "### Target alphabet:Default\r\n"
    "### Browsing enabled?No\r\n"
    "### Type of glossary:00000000\r\n"
    "### Case sensitive words?0\r\n"
    ";gls generated by bglgls\r\n\r\n"
    "### Glossary section:\r\n\r\n";
    
    int main(int argc,char **argv) {
    int ix;
    char szAuth[32];
    char szTitle[32];
    char szDescription[128];
    char datfname[128];
    
    if(argc!=2) { 
    	printf("usage: bglgls.exe filename.bgl\n"); 
    	return 0; 
    }
    //>get input
    printf("gls Author:");
    fgets(szAuth,32,stdin);
    printf("gls Title:");
    fgets(szTitle,32,stdin);
    printf("gls Description:");
    fgets(szDescription,128,stdin);
    
    szAuth[strlen(szAuth)-1]=0;
    szTitle[strlen(szTitle)-1]=0;
    szDescription[strlen(szDescription)-1]=0;
    sprintf(glsheader,glsheadertemplate,szAuth,szTitle,szDescription);
    //>set output filename
    strncpy(datfname,argv[1],128);
    ix=focc(datfname,'.');
    if(ix<0) { printf("invalid filename\n"); return 0; }
    datfname[ix]=0;
    strcat(datfname,".dat");
    //>>
    if(!uncomp_bgl(argv[1],datfname)) { printf("error uncompressing BGL.\n"); return 0; }
    if(!writegls(datfname)) { printf("error writing GLS.\n"); return 0; }
    return 0;
    }
    //>>uncompression routine
    int uncomp_bgl(char *bglname,char *datname) {
    FILE *ztmp;
    FILE *zfile;
    char iobuff[128];
    char tmppath[256];
    char tmpfname[256];
    unsigned char zptrbyte;
    int tread;
    
    //get temp filename
    GetTempPath(256,tmppath);
    GetTempFileName(tmppath,"bgl",0,tmpfname);
    ztmp=fopen(tmpfname,"wb");
    if(!ztmp) return 0;
    //>
    zfile=fopen(bglname,"rb");
    if(!zfile) return 0;
    fseek(zfile,0x5,SEEK_SET);
    fread(&zptrbyte,sizeof(char),1,zfile);
    printf("zlib header@0x%X\n",zptrbyte);
    fseek(zfile,zptrbyte,SEEK_SET);
    while(!feof(zfile)) {
    	tread=fread(iobuff,sizeof(char),128,zfile);
    	fwrite(iobuff,sizeof(char),tread,ztmp);
    }
    fclose(zfile);
    fclose(ztmp);
    //>>uncompressing >
    zfile=fopen(datname,"wb");
    ztmp=gzopen(tmpfname,"rb");
    if(!zfile||!ztmp) return 0;
    while(!gzeof(ztmp)) {
    	tread=gzread(ztmp,iobuff,128);
    	fwrite(iobuff,sizeof(char),tread,zfile);
    }
    gzclose(ztmp);
    fclose(zfile);
    DeleteFile(tmpfname); //get rid of temporary file
    return 1;
    }
    //write gls
    int writegls(char *datname) {
    FILE *fdic,*fgls;
    int ix,rec_length;
    short int lenword;
    unsigned char hdr,high_nibble,lenbyte;
    unsigned char lenmul,lenadd;
    unsigned long datapos;
    char tmpbuff[1024];
    char glsf[256];
    int tt=0,lt=0;
    
    //gls filename
    strcpy(glsf,datname);
    ix=focc(glsf,'.');
    glsf[ix]=0;
    strcat(glsf,".gls");
    printf("gls filename:%s\n",glsf);
    fgls=fopen(glsf,"wb");
    if(!fgls) return 0;
    //>write header
    printf("writing GLS");
    fwrite(glsheader,sizeof(char),strlen(glsheader),fgls);
    //>>parsing
    fdic=fopen(datname,"rb");
    if(!fdic) return 0;
    while(1) {
    	fread(&hdr,sizeof(char),1,fdic);
    	if(feof(fdic)) break;
    
    	//get record size
    	high_nibble=hdr >> 4;
    	if(high_nibble>=4) rec_length=high_nibble-4;
    	else {
    		for(ix=rec_length=0;ix<high_nibble+1;ix++) {
    			rec_length*=256;
    			fread(&lenbyte,sizeof(char),1,fdic);
    			rec_length+=lenbyte;
    		}
    	}
    	datapos=ftell(fdic);
    
    	switch(hdr & 0xF) {
    			case 1: {
    			fread(&lenbyte,sizeof(char),1,fdic);
    			memset(tmpbuff,0,1024);
    			fread(tmpbuff,sizeof(char),lenbyte,fdic);
    			if(!isalpha(tmpbuff[0])) break;
    			stripjunk(tmpbuff,0);
    			strcat(tmpbuff,"\r\n");
    			fwrite(tmpbuff,sizeof(char),strlen(tmpbuff),fgls);
    			fread(&lenmul,sizeof(char),1,fdic);
    				fread(&lenadd,sizeof(char),1,fdic);
    			memset(tmpbuff,0,1024);
    			lenword=lenmul*256+lenadd;
    			if(lenword>1019) lenword=1019;
    			fread(tmpbuff,sizeof(char),lenword,fdic);
    			stripjunk(tmpbuff,1);
    			strcat(tmpbuff,"\r\n\r\n");
    			fwrite(tmpbuff,sizeof(char),strlen(tmpbuff),fgls);
    			if(tt-100==lt) { lt=tt; printf("."); }
    			tt++;
    			} break;
    			default: break;
    	}
    	fseek(fdic,datapos+rec_length,SEEK_SET);
    }
    fclose(fdic);
    fclose(fgls);	
    DeleteFile(datname); //we dont need the *.dat anymore..
    printf("%d terms written to file!\n",tt);
    return 1;
    }
    //find occurrence
    int focc(char *cstr,char ch) { 
    int ix;
    for(ix=0;(unsigned)ix<strlen(cstr);ix++)
    	if(cstr[ix]==ch) return ix;
    return -1;
    }
    //>
    void stripjunk(char *buffer,char type) {
    int ix,slen;
    slen=strlen(buffer);
    
    if(!type) {
    	for(ix=1;ix<slen;ix++)
    		if(buffer[ix]=='$') { buffer[ix]=0; break; }
    	slen=ix;
    }	
    for(ix=0;ix<slen;ix++) 
    	if(!isvalidchar(buffer[ix])) { buffer[ix]=0; break; }
    }
    //valid term/definition char
    int isvalidchar(char ch) {
    	int ix;
    	char valtab[]="abcdefghijklmnopqrstuvwxyz 0123456789!@#$%&8()_-+=|{}[]<>\"',.%%/\\:;!?";
    	ch=tolower(ch);
    	for(ix=0;(unsigned)ix<strlen(valtab);ix++)
    		if(ch==valtab[ix]) return 1;
    return 0;
    }
    i have tested it with the code_analysis bgl that you suggested and it now works perfectly. i have also tested it with bablyons english_english dictionary (since it is the largest (18mb unpacked)) and it works really well.

    though something is missing and i couldnt figure it out.. babylon shows part-of-speech for each word and id guess that this information is stored in a table somewhere inside the bgl.. thats the last piece missing i believe.

    heres a binary compiled and linked with zlib.
    Attached Files Attached Files
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  12. #12
    hrmprog
    Guest
    hi
    i tried to use above code with unicode BGL but output file didn't complete
    which part of this code should be corrected?
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  13. #13
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Location
    Ring -1
    Posts
    4,204
    Blog Entries
    5
    The one that fails. And why don't you debug it and tell us which one that is?
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

  14. #14
    hrmprog
    Guest
    by use of above code with unicode BGL, in output file, there isn't any unicode letter and only english letter will appear. i try to conver english to farsi BGL, but in output file only english word appear.
    I promise that I have read the FAQ and tried to use the Search to answer my question.

  15. #15
    Administrator dELTA's Avatar
    Join Date
    Oct 2000
    Location
    Ring -1
    Posts
    4,204
    Blog Entries
    5
    Windows uses special APIs to handle unicode strings, you must integrate these into the existing source code.
    "Give a man a quote from the FAQ, and he'll ignore it. Print the FAQ, shove it up his ass, kick him in the balls, DDoS his ass and kick/ban him, and the point usually gets through eventually."

Similar Threads

  1. Dll source code
    By w_a_r_1 in forum The Newbie Forum
    Replies: 6
    Last Post: July 1st, 2009, 15:07
  2. I want to look at source code
    By mdhakk in forum The Newbie Forum
    Replies: 7
    Last Post: March 19th, 2005, 22:52
  3. help with asm source
    By LowF in forum The Newbie Forum
    Replies: 4
    Last Post: March 17th, 2003, 17:10
  4. VB source patch
    By current in forum Malware Analysis and Unpacking Forum
    Replies: 5
    Last Post: December 10th, 2000, 12:34

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •