2

I load a data table with fread the way I always do. The files has ~2M records and is tab delimited.

The load is successful. I can print the head of the table and the column names, so far so good.

But then either changing the name of the first column or setting it as a key fails complaining it cannot find the old column name. I am sure there is no typo in the column name, no heading or trailing space, I tried many times with copy/paste and retyping. I can change the name of apparently any other column.

The first column is long integer id's, so I had to load the bit64 package to get rid of a warning in 'fread', but it did not seem to help. Is it a clue?

Does anyone have any idea what could cause such a symptom? How to debug?

I use R 3.1.0 on Windows 64, latest version of all packages.

Edit: more details

The data load command:

txnData <- fread(txnInDataPathFileName, header=TRUE, sep="\t", na.strings="NA")

The column names:

colnames(txnData)
 [1] "txn_ext_id"  "txn_desc"       "txn_type_id"    "site_id"        "date_id"        "device_id"      "cust_id"       
 [8] "empl_id"        "txn_start_time" "txn_end_time"   "total_sales"    "total_units"    "gross_margin"

The rename column that fails (and so does setkey):

setnames(txnData, "txn_ext_id", "txnId")
Error in setnames(txnData, "txn_ext_id", "txnId") : 
  Items of 'old' not found in column names: txn_ext_id

And finally the requested dput command:

dput(head(txnData))
structure(list(`txn_ext_id` = structure(c(4.88536962440272e-311, 
1.10971996159584e-311, 9.9460266389845e-312, 1.0227644072435e-311, 
1.10329710699982e-311, 1.01930594588518e-311), class = "integer64"), 
    txn_desc = c("checkout transaction", "checkout transaction", 
    "checkout transaction", "checkout transaction", "checkout transaction", 
    "checkout transaction"), txn_type_id = c(0L, 0L, 0L, 0L, 
    0L, 0L), site_id = c(982L, 982L, 982L, 982L, 982L, 982L), 
    date_id = c("2012-12-24", "2013-11-27", "2013-04-08", "2013-06-04", 
    "2013-11-14", "2013-05-28"), device_id = c(8L, 7L, 8L, 53L, 
    8L, 5L), cust_id = structure(c(2.02600292130833e-313, 2.02572944866119e-313, 
    2.02583815970388e-313, 2.02580527009968e-313, 2.02568405005593e-313, 
    2.02736582767668e-313), class = "integer64"), empl_id = c("?", 
    "?", "?", "?", "?", "?"), txn_start_time = c("2012-12-24T08:35:56", 
    "2013-11-27T12:43:30", "2013-04-08T11:48:29", "2013-06-04T15:27:47", 
    "2013-11-14T12:57:38", "2013-05-28T11:03:21"), txn_end_time = c("2012-12-24T08:38:00", 
    "2013-11-27T12:47:00", "2013-04-08T11:49:00", "2013-06-04T15:35:00", 
    "2013-11-14T13:00:00", "2013-05-28T11:05:00"), total_sales = c(48.86, 
    69.7, 8.53, 33.46, 39.19, 35.56), total_units = c(12L, 44L, 
    3L, 4L, 14L, 17L), gross_margin = c(0, 0, 0, 0, 0, 0)), .Names = c("txn_ext_id", 
"txn_desc", "txn_type_id", "site_id", "date_id", "device_id", 
"cust_id", "empl_id", "txn_start_time", "txn_end_time", "total_sales", 
"total_units", "gross_margin"), class = c("data.table", "data.frame"
), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x00000000002c0788>)
Patrick
  • 1,561
  • 2
  • 11
  • 22
  • check for any trailing / leading spaces in the column names. – CHP Apr 22 '14 at 02:02
  • No heading and trailing spaces. I had tried retyping the column name and resaving the file, it did not help either. – Patrick Apr 22 '14 at 02:06
  • 2
    Perhaps you can add `dput(head(yourdata))` output to your question. – CHP Apr 22 '14 at 02:19
  • @ChinmayPatil Added the `dput(head(txndata))` & more info. Please let me know if you find any clue, I am puzzled! Thanks. – Patrick Apr 22 '14 at 15:43
  • Works fine for me when I use your data. :-/ Can you try it with R 3.0.3 instead? I think R 3.1.0 introduced some changes which broke something in data.table – CHP Apr 22 '14 at 15:55
  • @ChinmayPatil I tried R 3.0.0, 3.0.3, 3.1.0 all with their latest data.table & dependent packages for their corresponding version, all have the exact same error! File is encoded as UTF-8. This is driving me totally banana, particularly if you cannot reproduce! What else could be going wrong? – Patrick Apr 22 '14 at 16:51
  • A clue: I added a bogus column as the first one and it worked. When heading the data, the first column name showed a weird sign as a prefix that I don't see in my editor (notepad++). So I think the problem has to do with that. Now remains to figure out where it comes from and how to get rid of it! – Patrick Apr 22 '14 at 16:59
  • Perhaps - some non printable character in the string then. I have seen it messing things up before but I am forgetting now. – CHP Apr 22 '14 at 17:45
  • @ChinmayPatil Yes, the hidden character was the BOM. See my answer for details. Thanks for spending the time to help me investigate. – Patrick Apr 22 '14 at 17:59

1 Answers1

3

The hidden character was the Byte Order Mark (BOM) displayed as  when you get a chance of seeing it. You can in principle see it in editors set in ANSI display mode -- well I could not in Notepad++! In R, printing the head of the data table does show it as well using RStudio but it does not show it using Eclipse StatET that I use by default, explaining why I did not notice it immediately.

See the following links re. how to get rid of the BOM character: SO1, SO2, yiiframework.

I loaded my file in Notepad++, Encoding -> Convert to UTF-8 without BOM, saved, and this BOM character disappeared, all went fine.

A pure R solution to this problem without touching the file is to include the BOM character as the prefix in the rename command: setnames(dataTable, "firstColumnName", "firstColumnName"). This worked in RStudio and I suppose would work in R console as well. However, it does not work in Eclipse-StatET as the BOM character remains hidden while messing up data table accesses: the first column is not accessible with or without the BOM prefix in name, the setnames fail either way.

Community
  • 1
  • 1
Patrick
  • 1,561
  • 2
  • 11
  • 22