Why is R reading UTF-8 header as text?

Question

I saved an Excel table as text (*.txt). Unfortunately, Excel don't let me choose the encoding. So I need to open it in Notepad (which opens as ANSI) and save it as UTF-8. Then, when I read it in R:

data <- read.csv("my_file.txt",header=TRUE,sep="\t",encoding="UTF-8")

it shows the name of the first column beginning with "X.U.FEFF.". I know these are the bytes reserved to tell any program that the file is in UTF-8 format. So it shouldn't appear as text! Is this a bug? Or am I missing some option? Thanks in advance!

try it with the `read.csv` argument `check.names=FALSE`. Note that if you use this, you will not be able to directly reference columns with the `$` notation. — Matthew Plourde, Nov 12 '13 at 18:11
UTF-8 files are **not** supposed to contain a byte order mark, see [RFC 3629](http://www.ietf.org/rfc/rfc3629.txt) for explanation. — zwol, Nov 12 '13 at 18:17
Thanks @Matthew. It works partially. The X.U.FEFF is gone, but I can't refer to the first column by name anymore (the others still work, though). I still think this is a bug to be solved in future versions of R. — Rodrigo, Nov 12 '13 at 18:44
You can refer to them by name if you put them in quotes, e.g., `yourdf$"first col"` — Matthew Plourde, Nov 12 '13 at 18:45
@Zack, I've seen some UTF-8 files with these first bytes, so I thought it was a rule. Not a big problem, as I can always rename the first column, just think it should be solved someday. — Rodrigo, Nov 12 '13 at 18:45
I found a solution at https://stackoverflow.com/questions/24568056/rs-read-csv-prepending-1st-column-name-with-junk-text/24568505 — mqpasta, Jan 19 '21 at 10:48

score 17 · Accepted Answer · answered Nov 12 '13 at 18:58

17

So I was going to give you instructions on how to manually open the file and check for and discard the BOM, but then I noticed this (in ?file):

As from R 3.0.0 the encoding "UTF-8-BOM" is accepted and will remove a Byte Order Mark if present (which it often is for files and webpages generated by Microsoft applications).

which means that if you have a sufficiently new R interpreter,

read.csv("my_file.txt", fileEncoding="UTF-8-BOM", ...other args...)

should do what you want.

answered Nov 12 '13 at 18:58

zwol

135,547
38
252
361

5

hmmmm almost there. Now the "X.U.FEFF." became "ï.." – Rodrigo Nov 12 '13 at 19:23
2

That looks like the file isn't actually UTF-8. Is there any way you can show us a hex dump of the first line of the file? (On most Unix systems, `head -1 my_file.txt | hexdump -C` will get you a nice hex dump, but I have no idea about a Windows equivalent.) – zwol Nov 12 '13 at 20:24
In DOS Prompt, debug does this. The first three bytes are EF BB BF. (I saved the file in Notepad 5.1 build 2600, Windows XP SP3, and it says the format is UTF-8). The rest of the line is the ASCII for the column names. – Rodrigo Nov 12 '13 at 20:37
I need to see the dump for the entire line (or at least the entire first field, i.e. up to and including the first `09`), not just the first three bytes. – zwol Nov 12 '13 at 20:52
1

EF BB BF 43 4F 4C 45 43 41 4F 09 – Rodrigo Nov 13 '13 at 14:54
Huh. After stripping the BOM, the first field is all ASCII uppercase letters, which should go into a data frame colname just fine. Do you in fact have R 3.x? This is starting to look like a bug in the interpreter. – zwol Nov 13 '13 at 15:10
3

Yes, I have R 3.0.1. I downloaded Notepad++, and it gives me the option to save with and without the BOM. It seems R just can't handle the BOM. – Rodrigo Nov 13 '13 at 15:24

score 4 · Answer 2 · answered Nov 12 '13 at 18:17

4

most of the arguments in read.csv are dummy args -- including fileEncoding.

use read.table instead

 read.table("my_file.txt", header=TRUE, sep="\t", fileEncoding="UTF-8")

answered Nov 12 '13 at 18:17

Ricardo Saporta

54,400
17
144
178

1

With read.table I get an error: "Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 9191 did not have 25 elements". My read comment is actually more complicated, it is: data <- read.table("my_file.txt",header=TRUE,sep="\t",stringsAsFactors=FALSE,strip.white=TRUE,encoding="UTF-8",quote="") – Rodrigo Nov 12 '13 at 18:49
great!! Then it worked. Now you just need to clean up your source file ;) Open it up in a plain text editor (I like sublime text 3), get down to line 9191 and inspect it – Ricardo Saporta Nov 12 '13 at 20:11
Thanks, @Ricardo. I only needed the comment.char="". But now it behaves exactly the same as read.csv... :( – Rodrigo Nov 12 '13 at 20:33

score 1 · Answer 3 · answered Aug 06 '18 at 10:34

I had the same issue loading a csv file using either read.csv (with encoding="UTF-87-BOM"), read.table or read_csv from the readr package. None of these attempt proved successful.

I could definitely not work with the BOM tag because upon sub setting my data (using both approaches subset() or df[df$var=="value",]), the first row was not taken into account.

I finally found a workaround that made the BOM tag vanish. Using the read.csv function, I just defined a string vector for my column names in the argument col.names = ... . This works like a charm and I can subset my data without issues.

I use R Version 3.5.0

score 0 · Answer 4 · answered Nov 12 '13 at 18:46

Possible solution from the comments:

Try it with the read.csv argument check.names=FALSE. Note that if you use this, you will not be able to directly reference columns with the $ notation, unless you surround the name in quotes. For instance: yourdf$"first col".

Why is R reading UTF-8 header as text?

4 Answers4