Class 10 – Robots.txt: Preventing search engines from indexing your site
Thursday, July 23rd, 2009We have talked a bit about SEO and optimizing your site to be indexed by the major search engines in searches for particular keywords. We generally, but not always, want to make it easy for search engines to figure out what any given page is about. So it only seems appropriate that we should discuss the opposite procedure: how to prevent search engines from indexing your site.
The major search engines, run by Google, Yahoo, and Microsoft, send out spiders, which are automated programs that crawl the web in search of websites. Every page on every website a spider encounters is analyzed, categorized, and logged in a giant database. That database is what is used when someone does a search on a search engine for a particular term. If your site has been categorized as being related to that term, your site will show up in the search results on the search engine’s website.
Before any of the major search engines’ spiders index the contents of your site, they will look for a file on your web server called robots.txt. If you want to prevent the search engines from indexing your site and mentioning it in their search results, you should create a robots.txt file and upload it to the root folder of your website.
To prevent spiders from indexing the entire site, put this code into your robots.txt file:
User-agent: * Disallow: /
To prevent spiders from indexing only the subdirectory called “private”, put the following code in your robots.txt file:
User-agent: * Disallow: /private/
To prevent spiders from indexing both the “private” folder, and another folder called “my_stuff”, use a robots.txt file with the following code:
User-agent: * Disallow: /private/ Disallow: /my_stuff/
And so on. You can repeat the “Disallow” command with as many folders as you want to keep private.
For more information about robots.txt,check out The Web Robots Pages.