Top tips for .htaccess and Robots.txt Part 1

By : Rivmedia |January 16, 2011 |Featured, General |0 Comment

BadBots and the .htaccessWe often find with clients that the .htaccess and robots.txt file is most of the time left as standard or not used at all, This can not only cause your site to be using unnecessary bandwidth but can also expose it to scrapers, crawlers and “bad bots”. In this post we will go through a couple of useful hints and tips surrounding our favourite Blogging platform “wordpress” and our favorite Forum script “vbulletin” , that is not to say however that some of these things will not work on an ordinary site.I have commented below if a certain part is specifically for wordpress or vbulletin , the rest can be used across the board. This information can be found around the web

Lets start things off with the .htaccess file :

1. Redirecting www. to a non www or visa versa

: This is to ensure your site is not seen the same in two locations and is more for SEO or more specifically URL canonicalization.

What is URL canonicalization ? i hear you asking , URL canonicalization is what has been described by Matt Cutts Webspam team as “the process of picking the best url when there are several choices

Redirect non Www to Www. :

RewriteEngine On

RewriteBase /

RewriteCond %{HTTP_HOST} ^yourdomain.com

RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L]

Redirect Www to non Www :

RewriteEngine On

RewriteBase /

RewriteCond % ^www.yourdomain.com [NC]

RewriteRule ^(.*)$ http://yourdomain.com/$1 [L,R=301]

2. Filtering Out Badbots

Bad bots are generally Web Crawlers which cause you and your site problems, From spamming , scaping ( stealing your content ) finding certain file extensions ( like Mp3’s ) and harvesting email addresses. This bad bot list should help fight at least some of them . Feel free to add to this list in the comment sections :)

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Iria.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Stripper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Copier.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Crawler.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snagger.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Reaper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Grabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Sucker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Downloader.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Siphon.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Collector.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mag-Net.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Widow.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*DA.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snake.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*BackWeb.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*gotit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Vacuum.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SmartDownload.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pump.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HMView.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Ninja.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HTTrack.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JOC.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Memo.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*pcBrowser.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperBot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*leech.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mirror.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Recorder.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*GrabNet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Navroad.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*attach.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Magnet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Surfbot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Bandit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Ants.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Buddy.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Whacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*DISCo\Pump.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Drip.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*EirGrabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ExtractorPro.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*EyeNetIE.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FlashGet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*GetRight.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Gets.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Go!Zilla.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Go-Ahead-Got-It.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Grafula.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*IBrowse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*InterGET.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Internet\Ninja.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JetCar.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JustView.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*MIDown\tool.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mister\PiX.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*NearSite.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*NetSpider.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline\Explorer.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*PageGrabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Papa\Foto.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ReGet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Slurp.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SpaceBison.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperHTTP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebAuto.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Webcam\Watcher.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCopier.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebFetch.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebReaper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FreeLoader.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Clint’s\Webcam.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCam\Spy.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*CamEVU.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*iCamMaster.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Cam\Chaser\Pro.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FlashIT.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebSauger.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebStripper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebWhacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\Image\Collector.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\Sucker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Webster.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*eCatch.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ia_archiver.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*lftp.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*tAkeOut.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FileHound.*$
RewriteRule ^.* – [F,L]

3. Blocking access to your .htaccess file

. You really dont want people looking at your .htaccess file so use this to block them.

<Files .htaccess>

order allow,deny

deny from all </Files>

Now for the Robots.txt , The robots.txt is known as The Robots Exclusion Protocol. Basically meaning it can be used to tell web robots not to crawl a certain part of the sites structure, Most importantly places like admin folders , not only will this save you bandwidth but will also save the bots from indexing potencially harmful folders.

1. The best Robots.txt Document to use for wordpress blogs :

Note – Should only be used with WordPress installs

User-agent: *

Disallow: /wp-content/

Disallow: /wp-icludes/

Disallow: /trackback/

Disallow: /wp-admin/

Disallow: /archives/

Disallow: /category/

Disallow: /tag/*

Disallow: /tag/

Disallow: /wp-*

Disallow: /login/

Disallow: /*.js$

Disallow: /*.inc$

Disallow: /*.css$

Disallow: /*.php$

User-agent: All

Allow: /

User-agent: Googlebot-Image

Disallow: /

User-agent: ia_archiver

Disallow: /

User-agent: duggmirror

Disallow: /

2. The best Robots.txt documents to use with vbulletin Forums

: Note – Should only be used with vbulletin installs

User-agent: *

Disallow: /cgi-bin/

Disallow: /admincp/

Disallow: /announcement.php

Disallow: /calendar.php

Disallow: /cron.php

Disallow: /editpost.php

Disallow: /faq.php

Disallow: /joinrequests.php

Disallow: /login.php

Disallow: /member.php

Disallow: /misc.php

Disallow: /modcp/

Disallow: /moderator.php

Disallow: /newreply.php

Disallow: /newthread.php

Disallow: /online.php

Disallow: /printthread.php

Disallow: /private.php

Disallow: /profile.php

Disallow: /register.php

Disallow: /search.php

Disallow: /sendmessage.php

Disallow: /showgroups.php

Disallow: /showpost.php

Disallow: /subscription.php

Disallow: /subscriptions.php

Disallow: /threadrate.php

Disallow: /usercp.php

Obviously you could expand on these no end , however we feel these are at very least the basic functions that every wordpress and vbulletin install should ship with.

We’d love to hear comments on other methods or code you guys use to protect and opimise your sites, this is the end of part 1 . However Stay subscribed for Part 2 to be posted in the near future with more hints and tips including hotlink protection and how to locate the badbots useragents from your sites log files.