Prevent variable name getting mangled by read.csv/read.table?

Question

My data set testdata has 2 variables named PWGTP and AGEP

The data are in a .csv file.

When I do:

> head(testdata)

The variables show up as

    ï..PWGTP AGEP
          23   55
          26   56
          24   45
          22   51
          25   54
          23   35

So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.

HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:

Error: id variables not found in data: ï..PWGTP

Similarly, when I use some function to refer to the variable PWGTP, I get the message:

Error: id variables not found in data: PWGTP

2 Questions:

Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?
It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?

If you know how many columns you are reading and the order of names, you can just use `names(testdata) <- c("PWGTP", "AGEP", ...)` — Tim Biegeleisen, Jun 14 '16 at 04:10
Looks to me like a possible encoding issue... would your input file be UTF-8 with BOM? — Dominic Comtois, Jun 14 '16 at 04:31
@DominicComtois It is probably a `.csv` encoding issue. I have a larger data set where the variable names show up fine. I created `testdata` by copying and pasting the first few hundred rows (and the header row) of the larger data set. Something went wrong during that process. On examining `testdata` in a text editor or in `Excel`, however, it seems normal. So I was curious to find a fix in case this happens in a serious situation in future. — thanks_in_advance, Jun 14 '16 at 04:35
perhaps http://stackoverflow.com/questions/16838613/cannot-read-unicode-csv-into-r ? — leerssej, Jun 14 '16 at 04:37
I've reproduced it using a file encoded with UTF-8 with BOM... Using `fileEncoding = "UTF-8-BOM"` in the `read.table` should resolve the issue if you run into it again. — Dominic Comtois, Jun 14 '16 at 04:39
@TimBiegeleisen I was able to fix it by doing `names(testdata)[1] <- "PWGTP"` thanks to your suggestion — thanks_in_advance, Jun 14 '16 at 04:40
@DominicComtois That worked, thanks so much! If you post in the answer area I'll mark it as correct since this was the simplest correct answer. (I also was able to solve it another way, have a look at the previous comment, although that solution isn't as clean and elegant as yours). — thanks_in_advance, Jun 14 '16 at 04:49
Show the exact `read.csv/read.table` command you used, since that's what's causing the mangling. — smci, May 19 '18 at 02:14

score 2 · Accepted Answer · answered Jun 14 '16 at 04:48

This is a BOM (Byte Order Mark) UTF-8 issue.

To prevent this from happening, 2 options:

Save your file as UTF-8 without BOM / signature -- or --
Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv

Example:

mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")

score 0 · Answer 2 · answered Jun 14 '16 at 04:23

It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table

d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)

However, it is better not to have a name starting with number or have spaces in between.

So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names

names(d1) <- sub(".*\\.+", "", names(d1))

As an example

sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"

Thanks. I was able to fix it by doing `names(testdata)[1] <- "PWGTP"` — thanks_in_advance, Jun 14 '16 at 04:40

Prevent variable name getting mangled by read.csv/read.table?

2 Answers2

Linked

Related