This article is a brief introduction about robots.txt. All search engines commonly use web crawlers (also known as web spiders or bots) to crawl the internet and index massive amounts of data. These automated scripts crawl websites and index all web pages to serve up-to-date search results.
Assuming a wordpress blog although, a robots file wordpress use will be placed in root directory. Robots file works with all platforms and websites.
What is robots.txt?
It is a tiny text file resides on the root directory of a blog or a website and the purpose of this file is to guide web crawlers which files and folders they may or may not crawl.
Reasons to create robots.txt
Its a good SEO pratice. The first file Google bots will look while crawling your blog will be robots.txt and this will make indexing process much faster.
Your blog is live but in development phase and you don’t want search engines to crawl your blog.
Directing bots not to crawl posts or pages which are not important.
Some scripts may need to give special instructions to robots which could be placed in this file.
Could be used to fine tune blog’s accessability for all types of crawlers.
According to Google, don’t use robots.txt to hide your web pages from Google Search results. They could get indexed if some other post is pointing to them.Woopoo's Tip
How to create robots.txt
Lets create one for your blog. Login to your blog’s CPanel and see if this file is already present, if not, create one, name it robots.txt and save it on root level of your blog.
In wordpress robots.txt file location is public_html or www folder which is also called as root directory.Woopoo's Tip
The three main directives to use:
User-agent: it defines that the specific rules below will be applied to which search engines. Usually it will be * wildcard which means applies to all search engines.
Disallow: It will guide crawlers not to crawl whatever path will be placed after this directive.
Allow: Although this is optional but can be used to force search engines to crawl specific path or file mentioned after this directive.
See robots.txt file example below, paste the following code in your robots.txt file and save it.
The first line of code User-Agent: * means that the following rules are applied to all crawlers. Then the web crawlers are being instructed not to crawl specific directories with Disallow directive.
Setting robots.txt disallow all:
Disallow all will restrict all bots to crawl and index your blog. Should be used only if your blog is live but in development phase. Once it’s done, delete Disallow: / directiveWoopoo's Caution
That’s it. You have successfully created robots.txt for your blog. Whenever bots will crawl your blog, they will crawl everything apart from the files and folders you specified above.
some experts recommend to add wordpress folder wp-includes in the Disallow list also but not advisable now because some bots specially google bots will generate errors if you block wp-includes directory.Woopoo's Tip
Once the file is created, its time to test it with Google robots.txt Tester to ensure everything is working fine
Visit http://www.google.com/webmasters/tools/robots-testing-tool, login and enter your blogs url to test the file if everything is ok.
xml sitemaps are a great way to boost your blog’s seo. You can add Sitemap directive to robots.txt to let crawlers crawl your sitemap right from this file.
sitemap: http://yourblogname.com/path-to-xml-sitemap.xmlWoopoo's Tip
The story doesn’t end here. There are experts who have different opinions regarding this file. According to them:
Not an SEO practice anymore.
Use noindex meta tag in header of a page to disallow crawlers to crawl that page.
<meta name=”robots” content=”noindex” />
Instead of using sitemap directive, submit your sitemap directly to google at https://www.google.com/webmasters/tools/submit-url
Is it still worth it to use robots.txt?
Please tell us what do you think using the comments section below.