~ Essays ~
         to essays    essays
   
~ Bots Lab ~
         to bots    bots lab
(Courtesy of fravia's advanced searching lores)

(`. Adding engines to WebFerret .)
The guts of a search engines parser
by Laurent

(very slightly edited by fravia+)
published at searchlores in February 2001

An incredible deed. Webferret's own updating protocols reversed. I have the pleasure of having met Laurent -a mighty PHP wizard- InRealLife, and indeed he 'has the reversing force', that's for sure. Read this text, which is but a part of an even more important project that he is developing on his own with remarkable speed and competence: an automated bot for searching (-la-inference) various homepage providers. I note incidentally that people at Webferret should thank Laurent for the ready-made google script :-)
And when you have finished reading this essay, re-read it. And then work on your own: you'll love the possibilities that this approach will open to you, and to your own searches...

Your comments and suggestions (and further reversing and disassembling) would be welcome.


(`. Adding engines to WebFerret .)
by Laurent

Introduction

While gathering informations related to an ongoing project, I started to study and slightly reverse WebFerret, which seemed an interesting source of information and ideas for the above mentioned project. Although that project wasn't aimed at improving WebFerret, I thought that the discoveries I made could be worth an essay on their own.
The point that actually interest me is to figure out how WebFerret manages the query building and the results parsing of the differents search engines it support. Indeed, given that Webferret is a software that runs locally on your machine and given the fact that search engines often do (or at least at may) change their pages layout, there must be a way for WebFerret to keep updated to the last specifications. I can't imagine that I would have to download a new version each time a slight change would affect just one single Search Engine.
This made me think that the results page parsing algorithm cannot just be 'hardcoded' in webferret.

Investigations

A quick check of WebFerret's options shows that it have built-in support for proxies. That's a very interesting idea. Let's launch our favorite local proxy software (I had proxyplus at hand), tell WebFerret to connect via localhost:4480, run a simple webferret search session and ... oh oh, what's that ? Here is the proxyplus log file :

01/28/2001:18:05:57 127.0.0.1 - HTTP "GET http://www.euroseek.net:80/query?iflang=uk&query=fravia&domain=world&lang=world&style=ferret HTTP/1.0" 200 254 254/256 MISS 212.209.54.40 D
01/28/2001:18:05:57 127.0.0.1 - HTTP "GET http://www.search.com:80/search?ferret=1&q=fravia HTTP/1.0" 200 279 279/177 MISS 216.200.247.146 D
01/28/2001:18:05:58 127.0.0.1 - HTTP "GET http://www.altavista.com:80/cgi-bin/query?pg=aq&stype=stext&Translate=on&q=fravia&r=fravia&stq=10 HTTP/1.0" 200 299 299/225 MISS 209.73.180.3 D
01/28/2001:18:05:58 127.0.0.1 - HTTP "GET http://findwhat.com:80/bin/findwhat.dll?getresults&mt=fravia&dc=40&aff_id=7114 HTTP/1.0" 200 128 128/206 MISS 216.216.246.30 D
01/28/2001:18:05:58 127.0.0.1 - HTTP "GET http://www.hotbot.com:80/?MT=fravia&SM=B&DV=0&LG=any&DC=50&DE=2&_v=2&OPs=MDRTP HTTP/1.0" 200 282 282/206 MISS 209.185.151.128 D
01/28/2001:18:05:58 127.0.0.1 - HTTP "GET http://search.excite.com:80/search.gw?s=fravia&c=web&start=0&showSummary=true HTTP/1.0" 200 297 297/205 MISS 199.172.148.11 D
01/28/2001:18:05:58 127.0.0.1 - HTTP "GET http://northernlight.com:80/nlquery.fcg?cb=0&qr=fravia&orl=2:1 HTTP/1.0" 200 544 544/231 MISS 216.34.102.230 D
01/28/2001:18:05:59 127.0.0.1 - HTTP "GET http://search.msn.com:80/results.asp?q=fravia HTTP/1.0" 200 184 184/173 MISS 207.46.185.99 D
01/28/2001:18:05:59 127.0.0.1 - HTTP "GET http://search.aol.com:80/dirsearch.adp?query=fravia&start=web HTTP/1.0" 200 208 208/189 MISS 205.188.180.25 D
01/28/2001:18:05:59 127.0.0.1 - HTTP "GET http://val.looksmart.com:80/r_search?comefrom=izu-val&look=x&isp=zu&key=fravia&search=0 HTTP/1.0" 200 244 244/215 MISS 207.138.42.25 D
01/28/2001:18:05:59 127.0.0.1 - HTTP "GET http://wwwp.goto.com:80/d/search/p/cnet/xml/?Keywords=fravia&maxCount=40 HTTP/1.0" 200 138 138/200 MISS 206.132.152.249 D
01/28/2001:18:06:00 127.0.0.1 - HTTP "GET http://search.icq.com:80/dirsearch.adp?query=fravia&wh=web&bm=0 HTTP/1.0" 200 208 208/191 MISS 205.188.180.249 D
01/28/2001:18:06:01 127.0.0.1 - HTTP "POST http://vorlon.ferretsoft.com:80/update HTTP/1.0" 200 291 127/291 MISS 206.103.246.239 D
01/28/2001:18:06:01 127.0.0.1 - HTTP "POST http://vorlon.ferretsoft.com:80/update HTTP/1.0" 200 291 2798/291 MISS 206.103.246.239 D
01/28/2001:18:06:02 127.0.0.1 - HTTP "GET http://findwhat.com:80/bin/findwhat.dll?getresults&mt=fravia&dc=40&aff_id=7114 HTTP/1.0" 200 281 281/206 MISS 216.216.246.30 D
01/28/2001:18:06:02 127.0.0.1 - HTTP "GET http://www.hotbot.com:80/?MT=fravia&SM=B&DV=0&LG=any&DC=50&DE=2&_v=2&OPs=MDRTP HTTP/1.0" 200 480 480/206 MISS 209.185.151.128 D
01/28/2001:18:06:03 127.0.0.1 - HTTP "GET http://northernlight.com:80/nlquery.fcg?cb=0&qr=fravia&orl=2:1 HTTP/1.0" 200 1026 1026/231 MISS 216.34.102.230 D
01/28/2001:18:06:03 127.0.0.1 - HTTP "GET http://www.search.com:80/search?ferret=1&q=fravia HTTP/1.0" 200 958 958/177 MISS 216.200.247.146 D
01/28/2001:18:06:03 127.0.0.1 - HTTP "GET http://val.looksmart.com:80/r_search?comefrom=izu-val&look=x&isp=zu&key=fravia&search=0 HTTP/1.0" 200 321 321/215 MISS 207.138.42.25 D
01/28/2001:18:06:04 127.0.0.1 - HTTP "GET http://wwwp.goto.com:80/d/search/p/cnet/xml/?Keywords=fravia&maxCount=40 HTTP/1.0" 200 293 293/200 MISS 206.132.152.249 D
01/28/2001:18:06:04 127.0.0.1 - HTTP "GET http://www.euroseek.net:80/query?iflang=uk&query=fravia&domain=world&lang=world&style=ferret HTTP/1.0" 200 3967 3967/256 MISS 212.209.54.40 D
01/28/2001:18:06:05 127.0.0.1 - HTTP "GET http://bcs.zdnet.com:80/ads/ferret-ad?RGROUP=504/BRAND=637/QT=%3Afravia HTTP/1.0" 200 388 388/331 MISS 205.181.112.84 D
01/28/2001:18:06:06 127.0.0.1 - HTTP "GET http://search.excite.com:80/search.gw?s=fravia&c=web&start=0&showSummary=true HTTP/1.0" 200 6729 6729/205 MISS 199.172.148.11 D
01/28/2001:18:06:06 127.0.0.1 - HTTP "GET http://www.webcrawler.com:80/cgi-bin/WebQuery?search=fravia&showSummary=true&src=wc_results HTTP/1.0" 200 351 351/327 MISS 198.3.99.101 D
01/28/2001:18:06:06 127.0.0.1 - HTTP "GET http://search.aol.com:80/dirsearch.adp?query=fravia&start=web HTTP/1.0" 200 4400 4400/189 MISS 205.188.180.25 D
01/28/2001:18:06:07 127.0.0.1 - HTTP "GET http://www.altavista.com:80/cgi-bin/query?pg=aq&kl=XX&r=fravia&search=Search&q=fravia&d0=&d1= HTTP/1.0" 200 5165 5165/288 MISS 209.73.180.3 D
01/28/2001:18:06:08 127.0.0.1 - HTTP "GET http://search.icq.com:80/dirsearch.adp?query=fravia&wh=web&bm=0 HTTP/1.0" 200 2888 2888/191 MISS 205.188.180.249 D
01/28/2001:18:06:08 127.0.0.1 - HTTP "GET http://search.msn.com:80/results.asp?q=fravia HTTP/1.0" 200 4733 4733/173 MISS 207.46.185.99 D
01/28/2001:18:06:09 127.0.0.1 - HTTP "GET http://www.euroseek.net:80/query?iflang=uk&query=fravia&domain=world&lang=world&style=ferret&of=10 HTTP/1.0" 200 254 254/262 MISS 212.209.54.40 D

See those lines highlighted in red? A http POST request to http://vorlon.ferretsoft.com/update. Could it be so easy ? let's point our browser to that page. Ahi! a "404 Not found" :-( It would have been too nice. Anyway, the proxyplus logs tells that the server reply was a 200 OK, so there must be something. The trick is that the '/update' script will return a 404 to try to hide itself when it doesn't receive a valid request (which is btw a good 'protection' idea, imho).
So what? Give up? Certainly not! We won't be stopped by that, won't we? I have somewhere in my little tools box an http client/server code that should help me. Ok, let's shape that server code to our today's purpose. Don't forget to map vorlon.ferretsoft.com to 127.0.0.1 through our beloved HOSTS file and run webferret again. BINGO!! here is the actual POST request send by WebFerret:
POST /update HTTP/1.0
Content-type: application/x-www-form-urlencoded
Content-length: 96
Pragma: no-cache
Accept: */*
Host: vorlon.ferretsoft.com
X-Forwarded-For: 127.0.0.1
Via: 1.0 Proxy+ (v2.30 http://www.proxyplus.cz)

SASF
FerretSoft YourName YourCountry YourCompany
Here comes the first discovery: WebFerret implements a malicious 'phone home' feature (cfr the "malwares" lab). It sends back home your name, country and company. I say malicious because this isn't needed at all !!
Ok, you have been warned. But the interesting things are elsewhere.
Between the 'SASF' and 'FerretSoft', some binary data is also being sent. Well, let's remember that and keep it for later. The complete request send is available here
Let's now shape our little client source code so it will send the exact same request to the actual vorlon server. Let's grab it's answer and ... BINGO!! look at this vorlon reply :
HTTP/1.0 200 OK                <-- 200 OK, hehe we could fake it!
Content-Length: 2672           <-- quite a lot on info here
Expires: Thu, 01 Dec 1994 16:00:00 GMT
Content-Type: image/gif        <-- uh? a gif? 
Pragma: no-cache

SASF     REGPATCH1.0000        <-- this + what's below clearly shows 
                                       this is a registry patch file
[Web]

"RegistryVersion"=number:120

"InstalledEngines"=strings:\
  "AltaVista",\
  "AOLNetFind",\
  "Anzwers",\
  "CNET",\
  "EuroSeek",\
  "Excite",\
  "FindWhat",\
  "GOTO",\
  "HotBot",\
  "ICQ",\
  "LookSmart",\
  "LycosUSA",\
  "MSN",\
  "SearchUK",\
  "WebCrawler"

"ActiveEngines"=numbers:\
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

"NorthName"=
"NorthHome"=
"NorthURL"=
"NorthMethod"=
"NorthQueryType"=
"NorthQueryOps"=
"NorthQueryCloseness"=
"NorthQueryCommand"=
"NorthGrammar"=
"SearchDelay"=number:3000
"ExciteQueryCommand"=string:\
  "#0; >xx; <urlcloseness; sx~[<null~;>urlname~|3~]; $WebFerret; >httpUser-Agent; $search=; <+urlquerytext; $+&c=web&start=0&showSummary=true&perPage=50; >urlquery"
"ExciteGrammar"=strings:\
  "R:<li>*.<a href=*.('http://[eh; tb; >url|*.]')*.\">[eh; tb; >title|*.]</a>*.size8>[eh; tb; >abstract|*.]</span>"
"FindWhatQueryCommand"=string:\
  "<urlcloseness; sx~[<null~;>urlname~|3~]; $WebFerret; >httpUser-Agent; <urlquerytext; ?,:%2C; >urlquerytext; $getresults&mt=; <+urlquerytext; $+&dc=40&aff_id=7114; >urlquery"
"GOTOQueryCommand"=string:\
  "<urlcloseness; sx~[<null~;>urlname~|3~]; $WebFerret; >httpUser-Agent; <urlquerytext; ?,:%2C; >urlquerytext; $Keywords=; <+urlquerytext; $+&maxCount=40; >urlquery"
"SearchUKName"=string:"SearchUK"
"SearchUKHome"=string:"http://www.searchuk.com/"
"SearchUKURL"=string:"http://uk.searchengine.com/cgi-bin/search"
......
......
 
I didn't paste the whole answer cause it would make this essay unreadable. For those interested (and you should better be if you'r gonna build your bots on this :-) the whole reply is available here. You better download that file and view it with a good editor cause your browser probably won't render it correctly.

Ok, you certainly guessed it now: The whole bazar is stored in the windows registry. A quick search for 'Excitegrammar' in the registry confirm it.

So, what's left? Well, I spoke above about some binary data being sent along with your private details to the vorlon server. It becomes quickly apparent (especially when you compare that POST request with one sent by an old version -3.0200- of Webferret) that the version number, revision and patch level are included, respectively at offsets FE, FF/100 and 109 in those files. This allows the /update script to send back only the necessary updates to your current version of WebFerret. And this, as opposed to your Name, Company and Country, isn't malicious at all, quite the contrary.

Reversing

Well, this is exactaly what I was looking for. In the registry I can find all the informations WebFerret uses to build an url query and to parse the results for each search engines it supports.
At first sight, it seems they uses a mix of regular expressions with embedded scripts.
For example, take this : <a href=*.('http://[eh; tb; >url|*.]')*.\"> . It seems clear that what this do is to match the result page against <a href=*.('http://[*.]')*."> and then to assign the content of [ ] to an url variable (>url), after some unknow 'eh; tb;'

I'll skip my experiments (they were quite boring, much more than what you are actually reading, which is already passably boring) and deliver you my findings on a silver plate:

Beside the scripts in themselves uses a sort of 'vsl' (very simple langage) syntax. $ represents the working value, + means "append", - means "prepend". < means to get the value of a variable while > means to set the value of a variable. So, the line : "$search=; <+urlquerytext; $+&c=web&start=0&showSummary=true&perPage=50; >urlquery" is actually the following script (with the correct explanations at the right) :
$search=;value <= 'search='
<+urlquerytext;add the value of the variable urlquerytext to value
If urlquerytext='fravia', value will be 'search=fravia'
$+&c=web&start=0&showSummary=true&perPage=50;add the given string to value
value will now be : 'search=fravia&c=web&start=0&showSummary=true&perPage=50'
>urlqueryAssign the current value to a variable named 'urlquery'

Although I figured out the meaning of most of the functions/syntax, I'm convinced there are much more juicy things to learn inside WebFerret itself (like functions that are implemented but not yet used for any search engines). Alas! My reversing capabilities doesn't go that far and i'm lost in the dissasembly (especially when it comes to something written in C++ with classes and so on, which is the case for WebFerret). So, if anyone of you already did that work or is going to investigate this further, I would love to hear about it, as this is actually what does interest me the most (I suppose you already guessed what i'm trying to do :-)

Practical application

Ok, this is the second discovery and probably what some of you were looking for: how to add more engines to webferret. Well, should be quite easy if you followed me up to now: Just write a little registry patch file.
As an example, we'll add google to the list of engines supported by WebFerret.
Here is what should be added to the registry :

[HKEY_CURRENT_USER\Software\FerretSoft\NetFerret\CurrentVersion\Web]
"InstalledEngines"=strings:\
  "AltaVista",\
  "AOLNetFind",\
  "Anzwers",\
  "CNET",\
  "EuroSeek",\
  "Excite",\
  "FindWhat",\
  "GOTO",\
  "HotBot",\
  "ICQ",\
  "LookSmart",\
  "LycosUSA",\
  "MSN",\
  "SearchUK",\
  "WebCrawler",\
  "Google"
"ActiveEngines"=numbers:\
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
"GoogleName"="Google"
"GoogleURL"="http://www.google.com/search"
"GoogleHome"="http://www.google.com/"
"GoogleQueryType"="lip"
"GoogleMethod"=dword:00000000
"GoogleQueryCommand"="$WebFerret; >httpUser-Agent; $q=;<+urlquerytext; $+&lr=&safe=off&sa=N; >urlquery"
"GoogleQueryOps"=strings:" + "," OR "
"GoogleGrammar"=strings:\
 "R:<p><A HREF=[>url|*.]>[eh;tb;>title|*.]</A><font size=-1><br>[eh;tb;>abstract|*.]<font color=green>",\
 "S:<a href=/search\?[tb; >urlquery|*.]>",\
 "N:<b>Next</b>"
First,we have to add our new engine to the list of installed one (backdrawn: see below). Next we define some new Google specific terms: It's Name, URL, Home URL, Query type, request method, Query command, Query operands and finally the parsing grammar. I won't enter into details, most of those values are self explanatory. However, some have still unknow meanings to me. The QueryType, for example can take values like lip, lpp, sa, sap... But I have no clues what this means, so some experiments on this would be welcome. The Method indicates if WebFerret must use a POST (00000001) or GET (00000000) method.
The problem here is that we can't merge that directly into the registry. Some types, like the strings or numbers need first to be converted. Either you do this by hand or you write a quick script to handle this task for you.
Anyway, once the convertion is done, you should end up with something like:
REGEDIT4
[HKEY_CURRENT_USER\Software\FerretSoft\NetFerret\CurrentVersion\Web]
"InstalledEngines"=hex(7):\
  41,6C,74,61,56,69,73,74,61,00,41,4F,4C,4E,65,74,46,69,6E,64,00,41,6E,7A,77,65,72,\
  73,00,43,4E,45,54,00,45,75,72,6F,53,65,65,6B,00,45,78,63,69,74,65,00,46,69,6E,64,\
  57,68,61,74,00,47,4F,54,4F,00,48,6F,74,42,6F,74,00,49,43,51,00,4C,6F,6F,6B,53,6D,\
  61,72,74,00,4C,79,63,6F,73,55,53,41,00,4D,53,4E,00,53,65,61,72,63,68,55,4B,00,57,\
  65,62,43,72,61,77,6C,65,72,00,47,6F,6F,67,6C,65,00,00
"ActiveEngines"=hex:01,00,00,00,01,00,00,00,01,00,00,00,01,00,00,00,01,00,00,\
  00,01,00,00,00,01,00,00,00,01,00,00,00,01,00,00,00,01,00,00,00,01,00,00,00,\
  01,00,00,00,01,00,00,00,01,00,00,01,00,00,00,00  
"GoogleName"="Google"
"GoogleURL"="http://www.google.com/search"
"GoogleHome"="http://www.google.com/"
"GoogleQueryType"="lip"
"GoogleQueryOps"=hex(7):20,2B,20,00,20,4F,52,20,00
"GoogleQueryCommand"="$WebFerret; >httpUser-Agent; $q=;<+urlquerytext; $+&lr=&safe=off&sa=N; >urlquery"
"GoogleGrammar"=hex(7):\
  52,3A,3C,70,3E,3C,41,20,48,52,45,46,3D,5B,3E,75,72,6C,7C,2A,2E,5D,3E,5B,65,68,3B,\
  74,62,3B,3E,74,69,74,6C,65,7C,2A,2E,5D,3C,2F,41,3E,3C,66,6F,6E,74,20,73,69,7A,65,\
  3D,2D,31,3E,3C,62,72,3E,5B,65,68,3B,74,62,3B,3E,61,62,73,74,72,61,63,74,7C,2A,2E,\
  5D,3C,66,6F,6E,74,20,63,6F,6C,6F,72,3D,67,72,65,65,6E,3E,00,\
  53,3A,3C,61,20,68,72,65,66,3D,2F,73,65,61,72,63,68,3F,5B,74,62,3B,20,3E,75,72,6C,\
  71,75,65,72,79,7C,2A,2E,5D,3E,00,\
  4E,3A,3C,62,3E,4E,65,78,74,3C,2F,62,3E,00,00
"GoogleMethod"=dword:00000000
Save this registry patch to whatever you fancy and it's ready to be merged into the registry. For your convenience, this file is available here
Now some backdrawn. The problem is that the list of installed engines is put into a single key value. That means that whenever a new update will be retrieved from WebFerret home server, our modified list will be overwritten and thus all our new engines will be losts. One solution to this is to simply prevent WebFerret to retrieve any update information by simply adding it to your hosts file. This, however will bring some troubles when some engine will require an updated grammar or whatever else. I'll leave this problem to you. There are certainly different possible solutions. You could for example, write your own little proggie that will check if there is any new update available from time to time or you can re-apply your patched new search engines whenever you noticed an automated vorlon update occured. As my primary goal wasn't to use WebFerret as an actual tool but more as a source of inspiration, I didn't went any further in this direction.
Note also, that if you examine the registry you may find some other things that can help you fine-tune WebFerret to your requirements :-)

Conclusion

First let me be clear: I'm not stating that you should use webferret nor that adding a new engine to WebFerret is something really worth doing per se. I personnaly never used WebFerret before nor probably will I ever use it in the future. The purpose of this essay was simply to show you first that even without any software reversing knowledge you can twickle software to do what you want it to do. Second, I tried to show you that there is a lot to learn by studying some interesting targets. If I didn't studied WebFerret i would probably still be trying to figure out how to write a unniversal parsing script. WebFerret gave me much inspiration on this topic.
I can now apply what I have learned in this context to what was my original primary target: writing a sort of unniversal parser. I now know that some regular expression + some 'very simple language' scripts could be very helpful. If everything goes fine, I could end up with something worth publishing again very soon. So stay tuned :-)
As always, but here more than ever, feedbacks and critics, suggestions on this topic are really welcome. You can reach me at phplab@2113.ch.
Thank you for reading this essay, hope it was worth it.

(c) Laurent 2001


         to essays    Back to essays
   
         to bots    Back to bots lab
(c) 1952-2032: [fravia+], all rights reserved