No more noindex in robots.txt, REP to become standard

A pedant’s wet dream: after 25 years of widespread use of the REP standard responsible for communicating with crawlers via robots.txt, Google has submitted a proposal to the IETF (Internet Engineering Task Force) organization to standardize the protocol. There are also minor changes for us gray SEOs.

A little REP history

REP or robots exclusion protocol, also known as robots exclusion standard, and most commonly referred to simply as
robots.txt
, is a standard for communication between websites and crawlers buzzing around the Internet.

We have a whole host of crawlers scanning the web: Google bot, bots of SEO tools(Ahrefs, Majestic, etc.), Internet Archive, a whole host of smaller and larger robots designed to run analytics, rob us of valuable data or simply hack checking if maybe our WordPress hasn’t been updated since 2008.

To get a grip on all this junk, some crazy Dutchman named Martijn Koster created REP, a standard that allows a webmaster to ask one, several or all robots not to scan a single subpage, part of a site, or its entirety.

REP is thus a standard that excludes robot activity and can be used in conjunction with Sitemaps, a standard for including individual pages in active robot activities.

And in the image below you have a picture of the creator of robots.txt, eating a cake to celebrate the 25th anniversary of the robots.txt standard:

What happens to robots.txt?

REP, despite its widespread use, has never become an official Internet standard. The lack of official documentation and a leader has resulted in a certain arbitrariness in interpereting data, and some robots, such as the one from Internet Archive, have completely stopped listening to robots.txt directives.

Despite this, millions of webmasters are still using this standard – so Google decided it was high time to legalize this 25-year-old relationship. So it took 2 steps to do so:

  1. Request to the IETF, a nonprofit Internet standards organization, to standardize the protocol
  2. Making public the open robots.txt file parser code used by Google to interpret the code contained in the robots.txt file.

While the first step is of organizational and formal importance, the second step is important for webmasters, as specific lessons have flowed from it.

What is changing in robots.txt and what will change in the future?

Noindex in robots.txt

First of all: as of September 1, 2019, no more noindex directive in robots.txt, strangely God knows why recommended here and there. The only correct form of passing the noindex directive is through the HTTP or meta header in the page code, and removing a page from the index is also possible through 404 and 419 codes, password blocking, or through Search Console. End of story, period.

Note that in order to pass the noindex directive to Google, the site must not be blocked by robots.txt. How would a robot read a directive contained in a page that we prohibit it from accessing? 🙂

Not just http

What else will change? Robots.txt is set to become a protocol available not only to the http service. They will also be able to use it for CoAP or FTP servers, for example.

Cache for robots.txt

The maximum default caching time for the robots.txt file will be 24 hours; caching time directives are also to be respected. This will save transfer and server resources on the robots.txt file itself.

Unreachable robots.txt file

A very important change that will need to be looked at in the long run is a rule where: if we have a robots.txt file that was previously available, but is now unreachable, e.g. due to repeated server failures, the robot is to stop trying to retrieve it. The recently known disallow rules will be respected.

It is disturbing that the robot is expected to stop testing for a long time, quoting “reasonably long period of time” . What this means in practice is that if we have prolonged server problems, or, for example, accidentally delete the robots.txt file, it can be very slow to note changes in this file later.

To sum up

We will describe more changes after the release of the full specifications of the new standard, which will certainly be discussed before approval. As for the discussion, I’d be happy to engage in it with anyone who needs help optimizing their site, or any topic I cover on the blog. Feel free to contact me.

Leave a Reply

Your email address will not be published. Required fields are marked *