May 15, 2005

Rewriting Requests from Google

Setting up my website to refuse any links from Google and serve "" in answer to anyone who clicks a link to my website on Google was rather difficult for me. Took me about one day to figure it out, since I needed to study how the Apache web server's "rewrite engine" does this kind of thing and I had no idea of what "http_referer" exactly means and does.

As a result I included this code in the ".htaccess" file in my web home directory:

RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} ^ [OR]
RewriteCond %{HTTP_REFERER} ^ [OR]
RewriteCond %{HTTP_REFERER} ^
RewriteRule /* [R,L]

A few comments to explain this. The whole point of this code is to take a request from someone clicking on a Google link to (for example) and "rewrite" this request as "".

I note that this is not very friendly to a person clicking on the link and apologize for that. The alternatives would have been to direct them to a "403" error page by shutting out all traffic from Google as "forbidden" or to direct them to some page on my server explaining what I am doing here.

I decided to go with "" since I think that is what Google would find the least pleasing alternative. While this does leave some confused users for a short time, of course I expect any Google links to my site to disappear quickly, so this should not be a lasting problem for users, leaving the symbolic value of pointing to Google's strongest competitor as a decisive advantage over the other alternatives.

The first line "Rewrite Engine On" gives the Apache web server the instruction to turn rewriting on, which is turned off as a default.

The next line "RewriteCond %{HTTP_REFERER} !^$" starts setting conditions. This particular line says that no rewriting should be done if there is no "http_referer" data in the request. That happens with some firewall or browser settings. For example, users can easily turn off broadcasting the "http_referer" data in Firefox by typing "about:config" into the address bar and setting "Network.http.sendRefererHeader" from the "2" default to "0".

The next three lines tell the rewrite engine to look if the request comes from a click on a link from any Google site in Germany, Japan or the U.S.

If that condition is true, the last line gives the instruction to rewrite the request, returning "" instead.

I have tested the setup by clicking on links to my site on all of the Google sites above (just entering a direct request into the address bar would not work as a test, since in that case no "http_referer" data is sent). In all cases I got "", so it seems to be working fine at the moment.

Update: In comments at someone asked that I should go with the "403" alternative instead. I followed that request for the time being and will go back to the symbolic forwarding stuff only after some time has passed and Google has removed their links to me so that there are no users confused by this measure.

Update 2: I have discontinued blocking or redirecting traffic from Google links to this site.

When I started this two days ago, I thought that by now all links to my site in Google would be gone, so any redirection or blocking would be only symbolic.

However, in the meantime I learned from Nathan Weinberg at InsideGoogle that Google keeps links to pages in their index and search results even if their robot does not crawl those pages.

I don't want to confuse and annoy users permanently. Therefore I have pulled the blocking script from my .htaccess file.

Posted by Karl-Friedrich Lenz at May 15, 2005 11:32 AM