Wednesday, February 4, 2009

Detecting web-crawlers using cookies

This is how I figured out to split incoming traffic into three branches:
1) cache: web crawlers
2) browse: human users that just quickly browse the pages
3) active: human users that actively use the website

Let ActiveKey and BrowserKey be two fixed arbitrary strings.

This is the algorithm:
1. place in a header of all pages a bit of javascript, this javascript does:
1.1 check if browser knows a cookie by name of SID
1.1.1 if it doesn't have one, set one using javascript to a
value BrowserKey+RandomNumber, and
1.1.2 reload the page using javascript window.reload() call
1.2 if it does have it, just leave the page as is

This separates all web crawlers from all human users. Web crawlers are all the ones that don't send a cookie with name SID that contains a substring BrowserKey

2. If a user clicks a button and starts to use the website as an application (for example, puts products into a shopping cart), then the handler of all button clicks go through the same code that changes the state:
2.1. Change the value of the SID cookie to a string of the form ActiveKey+RandomNumber
2.2 Any ajax request or page navigation will now pass this cookie too automatically
This step separated users in category (2) and (3).

What is the point ? The point is that web crawlers can get older cache content. A site with many pages can serve the bots from pregenerated cached files, giving fast response. The web-crawlers can be served by another server altogether.

Also, users that want fast browsing experience, can be served by a separate server. etc...