Robots.txt files, otherwise known as the robot exclusion protocol, are an indispensable tool for SEO. This text file informs search engine crawlers which pages can be accessed and subsequently indexed. Robots.txt files also prevent crawlers from accessing certain parts of your website. This is useful if you want to prevent non-public pages from being indexed. This might include pages that are still being developed or online login pages. If your website is particularly extensive, Robots.txt is also helpful for ensuring your most relevant pages are indexed.
By outlining your requests in a Robots.txt file, search engines will only ever be able to access the pages you want them to. This not only provides you with a high degree of privacy but also maximises your crawl budget. Interested in learning more? Read on for an in-depth guide on why Robots.txt files are essential for SEO.
Major search engines like Google and Bing send out so-called “crawlers” to search through websites. Otherwise known as “robots” or “spiders”, these crawlers provide vital information to search engines so that your site can be properly indexed in search engine results pages (SERPs). This makes it easier for internet users to discover your site by entering queries into search engines. A Robots.txt file clearly outlines which pages can be searched and which pages robots should avoid.
Looking to block all search engine crawlers from accessing your customer login page? The following Robots.txt command can be used:
You can also tailor commands to focus on a particular search engine. If you only want to prevent Google crawlers from accessing your pages, the following command could be used:
To make your life easier, you can add as many pages as you wish to the disallow list. Once you’ve created a Robots.txt file, it should be placed in the main directory of your website. Using the above examples as a guide, the URL of a Robots.txt file should read something like this:
Why Block Access to Web Pages?
Blocking access to certain web pages will help bolster your SEO efforts. As such, you’ll need to understand when to bring a Robots.txt file into play. If your website includes duplicate pages, you mustn’t allow crawlers to index them. Why? Indexing duplicate content can be detrimental to your SEO.
Although Google and other search engines won’t impose penalties on you for duplicate content, needless indexing of duplicate pages can make it more difficult for your most valuable pages to rank well.
Robots.txt files also make it easier to get the most out of your crawl budget. Bot crawling is a valuable commodity that can boost your SEO performance. However, simultaneous crawls can prove overwhelming for smaller sites. Larger sites, or those with high authority, tend to have a larger crawl allowance.
However, less established sites must work with relatively modest budgets. Installing Robots.txt means you can prioritise the most important pages of your website, ensuring your crawl budget isn’t wasted on secondary pages and superfluous content.
There may also be web pages that you don’t want every user to be able to access. If your website is offering a service or includes a sales funnel, there are numerous pages you’ll only ever want to display to customers after they’ve completed a certain action. If your incentifying these actions with discount codes or loyalty rewards, you’ll only want users who’ve completed a customer journey to access them. By blocking these pages, you’re preventing casual users from stumbling upon this information via search engine queries.
Robots.txt files are also useful for ensuring search engines are prevented from indexing certain material, such as private imagery. They can also be used to pinpoint the location of a sitemap, as well as prevent your servers from overloading if bots attempt to index images simultaneously.
How to Create a Robots.txt File
Now we’ve explored the reasons why you may need a Robots.txt file, we can investigate how to create one. The easiest way to create a Robots.txt file is to use Google Webmaster Tools. Once you’ve created an account, click on ‘crawler access’ and then head to ‘site configuration’. Once you’ve accessed this part of the menu, click on ‘generate robots.txt’. This tool makes quick work of creating a Robots.txt file.
To block crawler access pages, simply select the ‘block’ option. You can then select ‘User-Agent’ to specify which search engine crawlers you want to block. Now, you can type in the site directories that you want to restrict access to. Rather than type the entire URL of the target page, you only need to add the extension into ‘directories and files’. In other words, if you want to block crawler access to your customer login page, you’d simply type:
Once you’ve finalised which pages you wish to block, you can click on ‘add rule’ to generate Robots.txt. The Robots.txt that is generated will also give you the option to ‘Allow’ exceptions, which is useful if you only want to restrict certain search engines from indexing your site.
With everything completed, you can now click the download icon to produce a final Robots.txt file.
How Do I Install a Robots.txt File?
Now all the hard work is taken care of you, it’s time to install your Robots.txt file. You can do this yourself by uploading your file with an FTP solution. However, if there are a few gaps in your programming knowledge, it might be best to bring in the services of an expert. If you’re assigning the task to a programmer, make sure you outline exactly which pages you want to be blocked and specify any exceptions.
Robots.txt Files: Key Things to Remember
To ensure you’re making the best use of Robots.txt files, there are some best practices to keep in mind. It may seem obvious, but make sure you’re taking stock of your pages and not blocking access to high-value pages you want to be crawled and indexed.
Although many users turn to Robots.txt to block sensitive information from being displayed on search engine results pages, it’s not the best way to keep such material out of the public eye. If other pages link to the ones you’ve blocked, there’s always a chance they may end up being indexed. Use an alternative approach to keep sensitive information hidden from view.
To ensure your Robots.txt file isn’t negatively impacting your SEO, you must keep it updated. Every time you add new pages, directories, or files to your website, you’ll need to update your Robots.txt file accordingly. Although this is only necessary if you’re adding content that needs to be restricted, revising your Robots.txt file is good practice. It not only guarantees that your site content is as secure as possible but can also benefit your SEO strategy.
By implementing Robots.txt effectively, you can maximise your crawl budget and prioritise your most important pages, prevent indexing of duplicate content, and minimise the chance of simultaneous crawls forcing your servers into a standstill.
Greg Tuohy is the Managing Director of Docutec, a business printer and office automation software provider. Greg was appointed Managing Director in June 2011 and is the driving force behind the team at the Cantec Group. Immediately after completing a Science degree at UCC in 1995, Greg joined the family copier/printer business. Docutec also make printers for family homes too such as multifunction printers.