Robots.txt

The robots.txt file of a website is what tells web crawlers and robots how to handle the site. Most notably, it tells search engines what to index, and what not to index.

You can use it on your site to tell search engines to ignore certain parts or elements on the site, or the entire site altogether.

Keep in mind, that the robots.txt is only a suggestion. It’s a worldwide known standard that most search engines and other robots follow and respect. But the file does not stop spam-bots and malware.

The robots.txt file is not a security or privacy tool. It is only a guide, a map for non-humans.

This file is primarily an exclusion standard, it tells robots what to ignore. For inclusion, a sitemap file is preferred..

We’ll get into how the file works, but making one and “installing” it is easy. Simply create a text file named robots.txt and upload it to the main folder of your domain. (http://www.example.com/robots.txt)

Generally, WordPress will automatically create a default file so make sure to check for one before creating it.

Reasons for using the file include blocking private pages from search engine results and items that are irrelevant to the site and its categorization.

Just always keep in mind that excluding a page in this text file is not a guarantee. Malicious web robots usually do not pay attention to any exclusion.

Matt Cutts explains this below:

Because web robots are essentially no different than a human to a computer, there is no way to guarantee your site will not be visited by them. To avoid any damage done by things like website copiers, separate security measures should be taken.

For example, you could configure your server to target robots with specific names and return errors or alternative content.[1]

Additionally, the robots.txt file will not work on subdomains. For example, if you have example.com and your blog is on blog.example.com, you’ll need a seperate robots file at blog.example.com/robots.txt

Of course, you should probably move your blog to example.com/blog because you’re losing SEO juice.

Example Uses

Below are some example robots.txt files. Simply copy these or add your own variation into a text file named robots and upload it to your server. (Or you can edit/replace the old one.)

This first example welcomes all robots to visit all of your sites files. * denotes a wildcard, or more literally, it means “all.” A “user-agent” is the name of the robot.

User-agent: *
Disallow:

If you have no robots.txt file, the above is the default.

This next example will tell all robots to stay out of your website. (No guarantee.) The slash represents the main folder of your website that contains all of its files.

User-agent: *
Disallow: /

Next, you can tell all robots to avoid a specific folder/directory on your site. (For example, replace “junk” with the name of the folder.)

User-agent: *
Disallow: /junk/
Disallow: /wp-admin/

You can also disallow a specific file, but allow everything else in the folder.

User agent: *
Disallow: /folder/secretPage.html

If you know of a bad/malicious bot, you can ask it to not crawl your site by calling it out by name.

User-agent: EvilBot
Disallow: /

Of course, you can tell specific bots to avoid specific folders (or files) too.

User-agent: EvilBot
User-agent: Googlebot
Disallow: /private/

If needed, or if you’re curious, User Agent String keeps a full list of web crawlers.

Alternative Uses

The above are the most recognized directives for search bots, but there are others that most (such as Google and Bing) will recognize.

For instance, crawl-delay can slow down the rate at which a single server can request pages from your site. The number given denotes seconds in between requests.

User-agent: *
Crawl-delay: 1

If you want to block an entire folder on your site, but want to allow a specific file, you can use:

Allow: /directory/file.html
Disallow: /directory/

If you have multiple sitemaps that you want the bot to take notice of, you can specify their URLs.

Sitemap: http://www.example.com/sitemap1.xml
Sitemap: http://www.example.com/sitemap2.xml

For multiple domains with the same content such as http, https, www, and non-www, you can set your preferred domain using host. Just make sure to include this at the bottom of your robots.txt file underneath the crawl-delay.

Host: example.com
-or-
Host: www.example.com

As an extra note, you can always use meta tags to tell bots that are crawling your site not to index it on search engine directories.

<meta name=”robots” content=”noindex”/>

References
[1] Httpd.apache.org, ‘Access Control – Apache HTTP Server Version 2.2′, 2015. [Online]. Available: https://httpd.apache.org/docs/2.2/howto/access.html. [Accessed: 25- Feb- 2015].

Back to top ▴