Top tips for .htaccess and Robots.txt Part 1
We often find with clients that the .htaccess and robots.txt file is most of the time left as standard or not used at all, This can not only cause your site to be using unnecessary bandwidth but can also expose it to scrapers, crawlers and “bad bots”. In this post we will go through a couple of useful hints and tips surrounding our favourite Blogging platform “wordpress” and our favorite Forum script “vbulletin” , that is not to say however that some of these things will not work on an ordinary site.I have commented below if a certain part is specifically for wordpress or vbulletin , the rest can be used across the board. This information can be found around the web
Lets start things off with the .htaccess file :
1. Redirecting www. to a non www or visa versa
: This is to ensure your site is not seen the same in two locations and is more for SEO or more specifically URL canonicalization.
What is URL canonicalization ? i hear you asking , URL canonicalization is what has been described by Matt Cutts Webspam team as “the process of picking the best url when there are several choices”
Redirect non Www to Www. :
RewriteBase /
RewriteCond %{HTTP_HOST} ^yourdomain.com
RewriteRule (.*) http://www.yourdomain.com/$1 [R=301,L]
Redirect Www to non Www :
RewriteBase /
RewriteCond % ^www.yourdomain.com [NC]
RewriteRule ^(.*)$ http://yourdomain.com/$1 [L,R=301]
2. Filtering Out Badbots
Bad bots are generally Web Crawlers which cause you and your site problems, From spamming , scaping ( stealing your content ) finding certain file extensions ( like Mp3′s ) and harvesting email addresses. This bad bot list should help fight at least some of them . Feel free to add to this list in the comment sections
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Iria.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Stripper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Copier.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Crawler.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snagger.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Reaper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Grabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Sucker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Downloader.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Siphon.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Collector.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mag-Net.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Widow.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*DA.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Snake.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*BackWeb.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*gotit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Vacuum.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SmartDownload.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pump.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HMView.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Ninja.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HTTrack.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JOC.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Memo.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*pcBrowser.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperBot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*leech.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mirror.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Recorder.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*GrabNet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Likse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Navroad.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*attach.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Magnet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Surfbot.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Bandit.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Ants.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Buddy.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Whacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*DISCo\Pump.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Drip.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*EirGrabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ExtractorPro.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*EyeNetIE.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FlashGet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*GetRight.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Gets.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Go!Zilla.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Go-Ahead-Got-It.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Grafula.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*IBrowse.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*InterGET.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Internet\Ninja.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JetCar.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*JustView.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*MIDown\tool.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Mister\PiX.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*NearSite.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*NetSpider.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Offline\Explorer.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*PageGrabber.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Papa\Foto.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Pockey.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ReGet.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Slurp.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SpaceBison.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*SuperHTTP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Teleport.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebAuto.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Webcam\Watcher.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCopier.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebFetch.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebReaper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FreeLoader.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Clint’s\Webcam.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebCam\Spy.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*CamEVU.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*iCamMaster.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Cam\Chaser\Pro.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FlashIT.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebSauger.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebStripper.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebWhacker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*WebZIP.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\Image\Collector.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Web\Sucker.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Webster.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Wget.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*eCatch.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*ia_archiver.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*lftp.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*tAkeOut.*$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^.*FileHound.*$
RewriteRule ^.* – [F,L]
3. Blocking access to your .htaccess file
. You really dont want people looking at your .htaccess file so use this to block them.
order allow,deny
deny from all </Files>
Now for the Robots.txt , The robots.txt is known as The Robots Exclusion Protocol. Basically meaning it can be used to tell web robots not to crawl a certain part of the sites structure, Most importantly places like admin folders , not only will this save you bandwidth but will also save the bots from indexing potencially harmful folders.
1. The best Robots.txt Document to use for wordpress blogs :
Note – Should only be used with WordPress installs
Disallow: /wp-content/
Disallow: /wp-icludes/
Disallow: /trackback/
Disallow: /wp-admin/
Disallow: /archives/
Disallow: /category/
Disallow: /tag/*
Disallow: /tag/
Disallow: /wp-*
Disallow: /login/
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.php$
User-agent: All
Allow: /
User-agent: Googlebot-Image
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: duggmirror
Disallow: /
2. The best Robots.txt documents to use with vbulletin Forums
: Note – Should only be used with vbulletin installs
Disallow: /cgi-bin/
Disallow: /admincp/
Disallow: /announcement.php
Disallow: /calendar.php
Disallow: /cron.php
Disallow: /editpost.php
Disallow: /faq.php
Disallow: /joinrequests.php
Disallow: /login.php
Disallow: /member.php
Disallow: /misc.php
Disallow: /modcp/
Disallow: /moderator.php
Disallow: /newreply.php
Disallow: /newthread.php
Disallow: /online.php
Disallow: /printthread.php
Disallow: /private.php
Disallow: /profile.php
Disallow: /register.php
Disallow: /search.php
Disallow: /sendmessage.php
Disallow: /showgroups.php
Disallow: /showpost.php
Disallow: /subscription.php
Disallow: /subscriptions.php
Disallow: /threadrate.php
Disallow: /usercp.php
Obviously you could expand on these no end , however we feel these are at very least the basic functions that every wordpress and vbulletin install should ship with.
We’d love to hear comments on other methods or code you guys use to protect and opimise your sites, this is the end of part 1 . However Stay subscribed for Part 2 to be posted in the near future with more hints and tips including hotlink protection and how to locate the badbots useragents from your sites log files.






