robots.txt: What This File Is For

The index file (robots.txt) and xml-sitemap are the most important information about a website: search engine bots use them to learn how to "read" this particular site, identify its important pages, and determine which ones can be skipped. When traffic drops, the first thing to check is the robots.txt file.

robots.txt: What This File Is For
Views: 453
Reading time: 5 minutes

Robots.txt: What Is It?

Robots.txt: What This File Is For

Robots.txt, also known as the "index file," is essentially a text file encoded in UTF-8 (using other encodings will result in incorrect processing). It provides search engine bots with "guidance" on what should be scanned first on a website. The file works for FTP, HTTP, and HTTPS protocols. Everything specified in robots.txt applies only to the location (port, protocol, host) where it is placed.

Robots.txt is placed in the root directory and, after publication, must be accessible at this address: https://site.com.ru/robots.txt.

In other locations/files, the BOM (Byte Order Mark) must be specified. This Unicode character indicates the byte order when reading data. U+FEFF is the code symbol.

The size of robots.txt should be slightly more than 500 KB. These limitations are set by Google.

When processing robots.txt data, search engine bots receive one of three access instructions:

  1. "Partial" – the bot can scan only specific elements and pages of the site
  2. "Full" – the bot has full access to all site content
  3. "Prohibition" – the bot is completely denied scanning access.

Here are the response options that search bots receive during scanning:

  • 2xx – successful scanning;
  • 3xx – the search bot is redirected until another response is received. After five failed attempts, a 404 error appears;
  • 4xx – the entire site with all its content is scanned – this is how the bot interprets it;
  • 5xx – this response indicates a complete scanning ban and is treated as a server error (temporary). The search bot will keep returning to this file until the response changes. If a page returns a "5xx" response along with a 404 error, the bot will process it with this code.

There is currently no data on how robots.txt handles files that are unavailable due to server connectivity issues.

Robots.txt: What It's For

Robots.txt: What It's For

There are situations and website pages that search engine bots don't need to see or visit:

  • Admin pages;
  • User personal information;
  • Search results;
  • Site mirrors.

Robots.txt works as a filter that directs search engine bots away from files that shouldn't be visible to everyone. Without an index file, this confidential information could end up in search engine results. However, there's a small but important caveat.

Important! There's a possibility that content listed in robots.txt might still appear in search results if links to it are found within the site or on external resources.

Robots.txt: Writing Algorithm

Robots.txt: Writing Algorithm

Robots.txt can be written in any text editor. It's important to follow the rules. User-agent and Disallow are the main directives; others (there are many) are secondary.

User-agent – a guide for search engine bots, of which there are over 300, something to keep in mind when creating robots.txt. It's often written only for the main search bot.

The primary bot for Google is Googlebot.

Specialized Google bots:

  • For Google AdSense service – Mediapartners-Google;
  • For evaluating page quality (landing pages) – AdsBot-Google;
  • For images – Googlebot-Image;
  • For video – Googlebot-Video;
  • Mobile version – Googlebot-Mobile.

Disallow – recommends to bots what should be scanned on the site. With its help, you can either fully open the site for scanning or prohibit it.

Important! This rule is typically used when a site is under development and shouldn't be indexed by search engines. Disallow should be "turned off" immediately after site work is completed, when it's ready for user visits. Webmasters often forget to do this.

Allow – another permissive rule. Used when search engine bots need to be directed to specific pages (/catalog), while the rest of the content remains closed to them.

Disallow and Allow are applied sequentially and sorted by URL length (prefix) from shortest to longest. If multiple rules apply to a page, the bot selects the last rule in the sorted list.

Sitemap – informs search engine bots that content to be indexed is located at: https://site.ru/sitemap.xml. When a search bot conducts its regular "crawl" and detects changes in this file, it immediately updates the information about it in its database. It's important to properly create a file with this directive.

Crawl-delay – a timer parameter that sets the time interval after which site (page) loading begins.

Important! This rule is for weak servers and applies to all search engines except Google.

Clean-param – helps avoid content duplication on a website (it can exist at addresses with "?" symbols). These addresses appear during different session IDs, sorting options, etc.

"/", "*", "$", "#" – main robots.txt symbols

When creating (writing) robots.txt, a special set of symbols is used:

"/" – slash. With its help, the webmaster indicates that a file is closed to the bot. If this symbol is written alone in Disallow, it means scanning of the entire site is prohibited. Two slash signs prohibit scanning of a specific category.

"*" – asterisk indicates the possibility of specifying characters in any sequence in the file. It's placed after each instruction.

"$" – dollar sign. This is a delimiter for the asterisk.

"#" – hash. Used when a webmaster wants to leave a comment that shouldn't be read by general users, and the bot should skip it.

Checking robots.txt

Checking robots.txt

After completing robots.txt writing, its correctness must be verified. This is done through Google's Webmaster Tools. You need to follow the link, then enter the file's source code into the specified form and indicate the site to be checked. Simply enter the robots.txt file's source code into the form at the link and specify the site being checked.

Common robots.txt mistakes to avoid:

Here are the most frequent errors made when filling out robots.txt, usually due to haste or inattention:

  • confused, mixed-up rules/instructions;
  • including multiple directory/folder entries in a single Disallow instruction;
  • incorrectly writing the index file itself. Only lowercase "robots.txt" is allowed. Using uppercase anywhere like Robots.txt or ROBOTS.TXT is PROHIBITED!
  • adding pages to robots.txt that shouldn't be there;
  • User-agent must always be filled in. It cannot be left empty;
  • extra symbols lead to errors in scanning by search engine bots.

Unconventional uses of robots.txt

Besides its primary function, the index file can become a platform for recruiting new employees (especially SEO specialists and creative professionals), up to including advertising blocks.

Summary:

Robots.txt, beyond its main function of providing instructions for search engine bots, allows a resource to recruit new employees, promote its company, experiment, and continuously improve. The key is to avoid making mistakes.

Have a business inquiry?

Let’s discuss!

Leave your contacts,
we'll get back to you shortly.