1

My data set testdata has 2 variables named PWGTP and AGEP

The data are in a .csv file.

When I do:

> head(testdata)

The variables show up as

    ï..PWGTP AGEP
          23   55
          26   56
          24   45
          22   51
          25   54
          23   35

So, for some reason, R is reading PWGTP as ï..PWGTP. No biggie.

HOWEVER, when I use some function to refer to the variable ï..PWGTP, I get the message:

Error: id variables not found in data: ï..PWGTP

Similarly, when I use some function to refer to the variable PWGTP, I get the message:

Error: id variables not found in data: PWGTP

2 Questions:

  1. Is there anything I should be doing to the source file to prevent mangling of the variable name PWGTP?

  2. It should be trivial to rename ï..PWGTP to something else -- but R is unable to find a variable named as such. Your thoughts on how one should try to repair the variable name?

smci
  • 32,567
  • 20
  • 113
  • 146
thanks_in_advance
  • 2,603
  • 6
  • 28
  • 44
  • 1
    If you know how many columns you are reading and the order of names, you can just use `names(testdata) <- c("PWGTP", "AGEP", ...)` – Tim Biegeleisen Jun 14 '16 at 04:10
  • 1
    Looks to me like a possible encoding issue... would your input file be UTF-8 with BOM? – Dominic Comtois Jun 14 '16 at 04:31
  • @DominicComtois It is probably a `.csv` encoding issue. I have a larger data set where the variable names show up fine. I created `testdata` by copying and pasting the first few hundred rows (and the header row) of the larger data set. Something went wrong during that process. On examining `testdata` in a text editor or in `Excel`, however, it seems normal. So I was curious to find a fix in case this happens in a serious situation in future. – thanks_in_advance Jun 14 '16 at 04:35
  • perhaps http://stackoverflow.com/questions/16838613/cannot-read-unicode-csv-into-r ? – leerssej Jun 14 '16 at 04:37
  • 1
    I've reproduced it using a file encoded with UTF-8 with BOM... Using `fileEncoding = "UTF-8-BOM"` in the `read.table` should resolve the issue if you run into it again. – Dominic Comtois Jun 14 '16 at 04:39
  • 1
    @TimBiegeleisen I was able to fix it by doing `names(testdata)[1] <- "PWGTP"` thanks to your suggestion – thanks_in_advance Jun 14 '16 at 04:40
  • @DominicComtois That worked, thanks so much! If you post in the answer area I'll mark it as correct since this was the simplest correct answer. (I also was able to solve it another way, have a look at the previous comment, although that solution isn't as clean and elegant as yours). – thanks_in_advance Jun 14 '16 at 04:49
  • @user1883050 Excellent, I've added it as an answer. :) – Dominic Comtois Jun 14 '16 at 05:43
  • Show the exact `read.csv/read.table` command you used, since that's what's causing the mangling. – smci May 19 '18 at 02:14

2 Answers2

2

This is a BOM (Byte Order Mark) UTF-8 issue.

To prevent this from happening, 2 options:

  1. Save your file as UTF-8 without BOM / signature -- or --
  2. Use fileEncoding = "UTF-8-BOM" when using read.table or read.csv

Example:

mydata <- read.table(file = "myfile.txt", fileEncoding = "UTF-8-BOM")

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
0

It is possible that the column names in the file could be 1 PWGTP i.e.with spaces between the number (or something else) and that characters which result in .. while reading in R. One way to prevent this would be to use check.names = FALSE in read.csv/read.table

d1 <- read.csv("yourfile.csv", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)

However, it is better not to have a name starting with number or have spaces in between.

So, suppose, if the OP read the data with the default options i.e. with check.names = TRUE, we can use sub to change the column names

names(d1) <- sub(".*\\.+", "", names(d1))

As an example

sub(".*\\.+", "", "ï..PWGTP")
#[1] "PWGTP"
akrun
  • 874,273
  • 37
  • 540
  • 662