I’ve found something interesting in the nginx mailing list today http://forum.nginx.org/read.php?2,202715,202715#msg-202715 . someone asked whether it is possible to block fake user agent such as google bot. sometimes, a lot of bots flood our servers disguise as google bot or other legal bot. Most likely, to scrape our website’s contents. Original google bot always uses the ip address which is owned by Google Inc. Many website owners complained, bad bots only drain their bandwidth usage. ๐
The first option to use “if” directive in nginx.
if ($http_user_agent ~* "Google Bot") { allow 66.x; allow 70.x; deny all; }
However, “if” directive considered to be a bad practice when use for anything rather than “return” or “rewrite”. Here’s an example from Igor Sysoev:
geo $not_google { default 1; 66.0.0.0/8 0; } map $http_user_agent $bots { default 0; ~(?i)google $not_google; } server { location / { if ($bots) { return 403; } } }
Here’s how it work:
geo $not_google { default 1; 66.0.0.0/8 0; }
This geo directive is used for mapping CIDR to value defined (0 or 1), Results will be passed to the variable called $not_google. It will return 1 or 0 depending on whether the client IP address ($remote_addr) matches the CIDR. Fake google bot’s ip address definitely will return results default value 1 while the original google bot will match with 66.0.0.0/8 and will return result 0.
map $http_user_agent $bots { default 0; ~(?i)google $not_google; }
When bot’s user agent string containย “google” it will be mapped to $not_google and the results passed to variable called $bots. if fake google bot, $bot will contains 1, real google will contain 0. anything else not contain “google” word in user agent string will have value default 0.
if ($bots) { return 403; }
if $bots = 1 return 403, else you can live your life as usual ๐
cheers ๐