~ ~

Search Engines Anti-Optimization

Get your own stop words!

Seeing beyond the surface.  
         to essays   

Published @ Searchlores in January 2003
Version 0.04 June 2007 | By Nemo

It's a tough world! Users simply HAVE TO defend themselves in a world where some commercial minions are allowed to write that "Twenty out of the 30 links Google is presenting on each page is not earning them money. That's an ad break of only 33%"... as if google would continue to predominate among search engines selling more crap, poor idiot.


Introduction Our quarry Words statistics Stop words Examples Conclusion

Introduction

Searchers usually have a fairly good idea of what they want to search, but words often have several meanings, or are used in several contexts, and spammers and SEOs take advantage of that to push their crap, by building pages tangentially related with your search terms, hoping that you would click on the ads crammed on those pages. Of course we can exclude the main keywords pertaining to the contexts we do not want, although that is easier said than done.

The purpose of this essay is to give you the means of finding those excluding keywords in a fairly easy way, in order to get ride of the spammed results on your own, without any need to trust the anti-spamming algos used by the search engines themselves. The idea is to reverse the SEO-spammers approach and build a list of the most common terms appearing in the spammed search results, with their relative frequency, so that you can spot at once the most spammed, and hence the most unwanted keywords for your query.

You can do just that at the Seekers' Oracle, which builds a list of the most common terms appearing in the spammed search results, with their relative frequency, by using web or image [for each image, Yahoo offers a snippet of text around the image] search APIs. For those interested in knowing how the tool was made, view my essay JSON for the masses; for the others, just use it.

Our quarry

I guess that you are reading this essay because you want to learn web search strategies. To get better results it is better to expand our query adding some terms. You can do it by yourself or using our tool to get terms frequency tables:

or using online tools to get synonyms such as:

Given that, our query can be expanded to the following one:

(search OR searching OR seek OR seeking) AND (web OR internet OR document OR documents OR file OR files OR webpage OR webpages OR "web page" OR "web pages") AND (tips OR hints OR strategies)

In the good ol' days, before AltaVista's demise, it was easy to refine this query, because we also have a fairly good idea of search terms proximity: --(search, searching, seek, seeking) and (web, internet, document, documents, file, files, webpage, webpages, web page, web pages) for grammatical reasons should be quite close one another, lets say at maximum 2, 3 words away; whereas the context terms (tips, hints, strategies) should gravitate in the neighborhood of the previous two sets, lets say at maximum 50 words away--. Thanks to AltaVista flawless support of boolean operators, distributive NEAR operator and variable size proximity search before AltaVista's demise, it were possible to use the following query

(search OR searching OR seek OR seeking) within 3 (web OR internet OR document OR documents OR file OR files OR webpage OR webpages OR "web page" OR "web pages") within 50 (tips OR hints OR strategies)

Nowadays exalead (variable size proximity search) and Yahoo (fixed(?) size proximity search) are the ones which are closer, but they still have flaws concerning distributive NEAR operator or variable size proximity search... so there's no way of using the cloud search strategy, which is quite unfortunate, because nowadays search results are polluted by big documents which, by chance or design, have our search terms. The best we can is excluding terms.

Word statistics

We start our analysis by spotting the most troublesome terms for our expanded query using our tool and inserting each search term one by one in order to get the most unwanted terms clinging to our query with their respective frequency. I only show words appearing ten or more times:

search:
services 61, resources 52, business 43, real 41, jobs 40, contact 38, estate 37, offers 34, products 34, marketing 33, job 32, service 32, optimization 28, product 25, reviews 21, property 20, travel 20, health 19, sale 18, homes 17, buy 14, career 14, businesses 13, medical 13, properties 13, bible 12, hosting 12, seo 11, solutions 11, storage 11, store 11, catholic 10, deals 10, employment 10, legal 10, shopping 10
seek:
god 85, reviews 71, game 57, games 50, business 41, job 41, play 40, product 34, jobs 33, products 33, services 31, review 30, shop 30, buy 29, bible 28, cheats 25, prices 25, classifieds 23, compare 22, health 22, shopping 22, solutions 22, treatment 21, christian 20, employment 20, counseling 18, marketing 18, price 18, store 18, shipping 17, stock 17, playstation 16, sports 16, contact 15, medical 15, careers 14, toys 14, jesus 13, church 12, lord 12, career 11, christ 10, design 10, discount 10,
web:
hosting 477, design 468, services 221, development 204, offers 118, business 81, marketing 69, solutions 61, service 59, products 53, developer 43, designers 42, designing 40, reviews 33, developers 32, offering 27, providing 27, seo 27, designer 23, reservations 23, businesses 22, optimization 21, hotel 20, health 18, designed 17, shopping 17, store 17, hosts 16, hotels 16, solution 16, advertising 14, designs 14, leading 14, price 14, consulting 13, prices 13, travel 13, airlines 12, games 12, shop 12, medical 10, reviewed 10.
internet:
marketing 150, services 132, business 118, service 117, offers 61, hotel 59, hosting 58, design 54, products 47, buy 43, product 43, solutions 43, shop 40, law 38, reviews 33, development 32, advertising 28, store 27, prices 25, games 21, hotels 21, jobs 20, offering 20, discount 18, inn 18, order 18, leading 17, shopping 16, consulting 15, seo 15, legal 13, optimization 13, shipping 13, businesses 12, compare 12, contact 12, promotion 12, deals 11, developer 11, price 11, travel 11, cost 10, dating 10, health 10, sales 10,
document:
services 167, solutions 128, products 82, business 62, legal 56, product 50, service 45, travel 36, delivery 34, shop 32, solution 32, buy 27, design 27, law 27, review 22, job 15, sales 15, shopping 15, compare 14, hosting 14, development 13, items 12, purchase 12, jobs 12, marketing 11, reviews 11, stores 11, developer 10, order 10,
file:
product 67, products 67, furniture 64, shop 59, buy 44, services 41, store 38, nail 33, service 28, prices 26, hosts 25, business 24, reviews 22, solutions 22, compare 20, legal 20, order 20, shopping 18, shipping 14, hosting 13, purchase 13, offer 11, design 10, games 10,
webpage:
products 93, services 45, health 38, product 38, offers 34, design 32, service 31, business 30, solutions 30, game 29, games 26, hotel 26, court 25, sports 24, store 24, dispute 23, development 21, reviews 18, designer 15, hosting 14, contact 13, mediation 13, resort 13, leading 12,
tips:
travel 83, products 62, buy 50, health 49, design 47, business 46, shop 44, product 41, offers 40, services 36, marketing 33, buying 31, reviews 31, contact 27, service 26, recipes 24, fitness 23, shipping 23, career 22, diet 22, prices 22, dating 21, gardening 20, medical 20, mortgage 20, purchase 19, sales 19, adsense 18, sports 18, store 18, compare 17, discount 17, loans 17, order 17, job 16, loan 16, shopping 14, furniture 12, game 12, refinance 12 ,solutions 12, equity 11, estate 11,
hints:
cheats 304, games 173, game 162, recipes 50, reviews 50, playstation 49, xbox 48, health 46, cooking 43, baking 27, kitchen 27, product 24, products 22, design 20, nintendo 20, service 18, shop 18, solutions 18, gaming 17, contact 16, review 16, buy 15, prices 15, recipe 15, job 14, shopping 14, solution 13, sports 13, store 13, business 12, garden 11, optimization 11, services 11, order 10
strategies:
business 147, marketing 126, services 71, health 46, design 43, consulting 39, solutions 37, game 33, career 30, jobs 24, offers 24, products 24, job 23, games 21, service 19, contact 18, prices 17, developing 16, leading 16, estate 15, poker 15, product 15, real 15, sales 15, shopping 14, review 13,advertising 12, compare 12, shop 12, reviews 12, price 11, buy 10, careers 10,

Lets organize these stopwords by subject to get a better grasp of our enemies:

search:
seek:
web:
internet:
document:
file:
webpage:
tips:
hints:
strategies:

It's interesting to compare the idiosyncrasies of each keyword... words are not made equal, even synonyms! Lets squeeze all this information and get the list of our own stopwords.

Stop words

With the exception of exalead, all other search engines only allow a more or less limited amount of keywords per query, so our list is sorted and split by main offenders / quarry specific offenders and for each category I have chosen the keywords which offered the biggest bang with the least amount of gunpowder and collateral damages.

Main offenders
Quarry specific offenders

With it we can do a deep cleaning on our search results, as we are going to see in the following examples.

Examples

Lets see some search queries pertaining to web search:

Yahoo
Google
Ask
Exalead
Live search
Gigablast

It remarkable how full of crap web is, as more than 99% of search results is weeped out by excluding these troublesome keywords. It is worth mentioning that these offenders are quite ubiquitous and excluding them in other search queries often cuts quite some mustard. Lets see an example for a research I made a long time ago concerning The History of French Louisiana, where some keywords were not excluded, as they might have significant collateral damages (buy, price, travel, god, church, legal, law):

Yahoo
Google
Exalead
Ask
Live Search
Gigablast

Although we have not used our tool to get the query specific offenders (genealogy, for instance), the previous set is sufficiently powerful to wipe out more than 99% of commercial crap.

Conclusion

Ironically the SEO-webmaster's obsession to optimize their webpages, using every possible keywords associated to their content, simplifies our task: -avoiding them-.

(c) Nemo 2003 2007    nemo vitam meam regit@yahoo.com    replace white spaces by underscores.



Petit image

(c) III Millennium: [fravia+], all rights reserved, reversed, reviled and revealed