11

I am about to create a robots.txt file.

I am using notepad.

How should I save the file? UTF8, ANSI or what?

Also, should it be a capital R?

And in the file, I am specifying a sitemap location. Should this be with a capital S?

  User-agent: *
  Sitemap: http://www.domain.se/sitemap.xml

Thanks

Jacco
  • 23,534
  • 17
  • 88
  • 105
  • 3
    ANSI means [American National Standards Institute](http://en.wikipedia.org/wiki/American_National_Standards_Institute). I guess you rather mean US-ASCII. – Gumbo Sep 28 '10 at 20:37
  • Unfortunately, Microsoft used to use "ANSI" to mean a code page (which in turn depends on a setting, so not well-defined in the best of worlds). Historically, the company was striving to get the ANSI to actually accept these as a standard, but luckily, that never happened. In practice, this option in Notepad usually means the current 8-bit code page (so probably something like code page 1252). – tripleee Sep 20 '22 at 05:43
  • You are doing something wrong if you populate this file with anything which isn't straight 7-bit US-ASCII, regardless of what the standards say. – tripleee Sep 20 '22 at 05:45

7 Answers7

8

Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.

However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.

Matt Wilko
  • 26,994
  • 10
  • 93
  • 143
Roland Illig
  • 40,703
  • 10
  • 88
  • 121
2

I believe Robots.txt "should" be UTF-8 encoded.

"The expected file format is plain text encoded in UTF-8. The file consists of records (lines) separated by CR, CR/LF or LF."

/from https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).

Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.

The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.

The rest of the lines will be read successfully.

Max
  • 3,280
  • 2
  • 26
  • 30
1

As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.

If you need to serve UTF-8 for some reason, make sure this is specified correctly in the content-type header of the text file. You will have to set this in your web server's settings.

As to case sensitivity:

  • According to robotstxt.org, the robots.txt file needs to be lowercase:

    Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

  • The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (Sitemap).

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
0

I recommend either encoding robots.txt in UTF8, without BOM, or encoding it in ASCII.

For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.

Take a look at Wikipedia's robots.txt file - it's UTF8 encoded.

See references:

Ron Klein
  • 9,178
  • 9
  • 55
  • 88
0

I think you're over thinking things too much. I always do lowercase, just because it's easier.

You can view SO's robots.txt. https://stackoverflow.com/robots.txt

Community
  • 1
  • 1
Robert
  • 21,110
  • 9
  • 55
  • 65
  • Ok, also, do you know if I place a sitemap inside a directory on the server, will I then be able to have urls at higher levels like root in that sitemap? Or does the sitemap have to be on top level? –  Sep 28 '10 at 20:34
  • @Camran that's an entirely separate question. I'd suggest to start it as such. – Pekka Sep 28 '10 at 20:35
0

I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).

dario
  • 2,861
  • 3
  • 15
  • 34
0

Most answers seem to be outdated. As of 2022, Google specifies the robots.txt format as follows (source):

File format

The robots.txt file must be a UTF-8 encoded plain text file and the lines must be separated by CR, CR/LF, or LF.

Google ignores invalid lines in robots.txt files, including the Unicode Byte Order Mark (BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the content downloaded is HTML instead of robots.txt rules, Google will try to parse the content and extract rules, and ignore everything else.

Similarly, if the character encoding of the robots.txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.

Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating directives that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.

TL;DR to answer the question:

  • You can use Notepad to save a robots.txt file. Just use UTF-8 encoding.
  • It may or may not contain a BOM; It will be ignored anyways.
  • The file has to be named robots.txt exactly. No capital "R".
  • Field names are not case sensitive (source). Therefore, both, sitemap and Sitemap are fine.

Keep in mind that robots.txt is just a de-facto standard. There is no guarantee any crawler will read this file as Google proposes it to do nor is any crawler forced to respect any defined rules.

stackprotector
  • 10,498
  • 4
  • 35
  • 64