5

Windows 8.1, R version 3.1.1 (2014-07-10), System x86_64, mingw32

I've got a file with a lot of observations (here). Here are some strings from the file

Date;Time;Global_active_power;Global_reactive_power;Voltage;Global_intensity;Sub_metering_1;Sub_metering_2;Sub_metering_3
16/12/2006;17:24:00;4.216;0.418;234.840;18.400;0.000;1.000;17.000
16/12/2006;17:25:00;5.360;0.436;233.630;23.000;0.000;1.000;16.000
28/4/2007;00:20:00;0.492;0.208;236.240;2.200;0.000;0.000;0.000
28/4/2007;00:21:00;?;?;?;?;?;?;
21/12/2006;11:25:00;0.246;0.000;241.740;1.000;0.000;0.000;0.000
21/12/2006;11:26:00;0.246;0.000;241.830;1.000;0.000;0.000;0.000

The NA values are represented by "?". I'm trying to read the file with

epcData <- fread(dataFile,
                 sep = ";",
                 header = TRUE,
                 na.strings = "?",
                 colClasses = c("character", "character", rep("numeric", 7)),
                 stringsAsFactors = FALSE)

I've got warnings like:

Bumped column 3 to type character on data row 10, field contains '?'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

The row 10 is

   28/4/2007;00:21:00;?;?;?;?;?;?;

epcData[10]

prints

         Date     Time Global_active_power Global_reactive_power Voltage
1: 28/4/2076 00:21:00                  NA                    NA      NA
   Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1:               NA             NA             NA             NA

But the modes of all columns are "character" even for columns 3:9 (but colClasses = c("character", "character", rep("numeric", 7))).

What is going wrong?

nodm
  • 793
  • 1
  • 6
  • 8
  • What OS are you using? – Mike.Gahan Sep 08 '14 at 13:10
  • 1
    If it is Linux or OSX, it might be worth using `fread(sed -i 's/?/NA/g' yourcsv.csv)` to find and replace the question marks before fread starts reading it at all. – Mike.Gahan Sep 08 '14 at 13:16
  • Sorry! Windows 8.1, R version 3.1.1 (2014-07-10), System x86_64, mingw32 – nodm Sep 08 '14 at 18:01
  • Thanks Mike.Gahan! I have some ideas to solve the problem. But it's very interesting to me what's wrong with my code. – nodm Sep 08 '14 at 18:07
  • IMHO the problem regards to 'na.strings' parameter. I have tried with 'epcData <- fread(dataFile)' and got the same warning. – nodm Sep 08 '14 at 18:25
  • perhaps it is using `?` as a regex special character. Does `//?` work? – Mike.Gahan Sep 08 '14 at 18:27
  • No. I've got the same warning. – nodm Sep 08 '14 at 19:03
  • Just read the documentation, `na.string` seems to only work for string vectors. Not a big deal. Just convert to `as.numeric` after you read in the data. – Mike.Gahan Sep 08 '14 at 19:25
  • 1
    @Mike.Gahan read.table(dataFile,header=TRUE, sep=";", na.strings = "?", colClasses = c("character","character", rep("numeric",7)), stringsAsFactors = FALSE)) 'read.table' **with the same parameters ** works fine but very-very slow. – nodm Sep 10 '14 at 19:17
  • Here is the [link](http://stackoverflow.com/questions/15784138/bad-interpretation-of-n-a-using-fread) to the question like mine. But there is no answer to that question too. – nodm Sep 10 '14 at 19:30
  • And [another link](http://stackoverflow.com/questions/22331552/read-in-certain-numbers-as-na-in-r-with-data-tablefread). It seems to be a bug. – nodm Sep 10 '14 at 19:34
  • On the bright side, that link seems to suggest that `data.table` maintainers @Arun and @MattDowle are looking to improve this. – Mike.Gahan Sep 10 '14 at 20:56
  • @DataNoob, did you ever figure this out? I have the same problem on the same Coursera assignment and I found this. Did you just abandon data tables altogether? I've wasted 2 hours on this! – Parseltongue May 05 '15 at 16:16
  • No, I didn't figure out this. – nodm May 07 '15 at 20:03

1 Answers1

2

As of today with version 1.12.2 of the data.table package. This is no longer an issue and the import of the above csv data works flawlessly and all the question marks are replaced by NAs

hannes101
  • 2,410
  • 1
  • 17
  • 40