3

What does the 'rb:bom|utf-8' mean in:

CSV.open(csv_name, 'rb:bom|utf-8', headers: true, return_headers: true) do |csv|

I can understand that:

  1. r means read
  2. bom is a file format with \xEF\xBB\xBF at the start of a file to indicate endianness.
  3. utf-8 is a file format

But:

  1. I don't know how they fits together and why is it necessary to write all these for reading a csv
  2. I'm struggling to find the documentation for this. It doesn't seem to be documented in
    https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html

Update:

Found a very useful documentation: https://ruby-doc.org/core-2.6.3/IO.html#method-c-new-label-Open+Mode

Henry Yang
  • 2,283
  • 3
  • 21
  • 38
  • 1
    See @matt's answer [here](https://stackoverflow.com/questions/20717902/parsing-a-csv-file-using-different-encodings-and-libraries). – Cary Swoveland Jul 15 '19 at 02:36

2 Answers2

6

(The accepted answer is not incorrect but incomplete)

rb:bom|utf-8 converted to a human readable sentence means:

Open the file for reading (r) in binary mode (b) and look for an Unicode BOM marker (bom) to detect the encoding or, in case no BOM marker is found, assume UTF-8 encoding (utf-8).

A BOM marker can be used to detect if a file is UTF-8 or UTF-16 and in case it is UTF-16, whether that is little or big endian UTF-16. There is also a BOM marker for UTF-32, yet Ruby doesn't support UTF-32 as of today. A BOM marker is just a special reserved byte sequence in the Unicode standard that is only used for the purpose of detecting the encoding of a file and it must be the first "character" of that file. It's recommended and typically used for UTF-16 as it exists in two different variants, it's optional for UTF-8 and usually if a file is Unicode but has no BOM marker, it is assumed to be UTF-8.

Mecki
  • 125,244
  • 33
  • 244
  • 253
  • could you please provide a reference? I could not find any official documentation explaining the above. I am unsure how it comes that reading as `binary` instead of `text` works in this case – rellampec Sep 10 '20 at 14:44
  • 1
    @rellampec It's all documented on the Ruby IO documentation page https://docs.ruby-lang.org/en/2.1.0/IO.html Look for "Open Mode" and keep reading. – Mecki Sep 10 '20 at 14:53
1

When reading a text file in Ruby you need to specify the encoding or it will revert to the default, which might be wrong.

If you're reading CSV files that are BOM encoded then you need to do it that way.

Pure UTF-8 encoding can't deal with the BOM header so you need to read it and skip past that part before treating the data as UTF-8. That notation is how Ruby expresses that requirement.

tadman
  • 208,517
  • 23
  • 234
  • 262