1

I have a 5 G file data to load. fread seems to be a fast way to load them but it reads all my data structures wrong. It looks like it is the quotes that result the problem.

# Codes. I don't know how to put raw csv data here.   
dt<-fread("data.csv",header=T)
dt2<-read.csv("data.csv",header=T)
str(dt)
str(dt2)

This is the output. All data structures of fread variables are char regardless whether it is num or char.

enter image description here enter image description here

MLE
  • 1,033
  • 1
  • 11
  • 30

2 Answers2

3

It's curious that fread didn't use numeric for the id column, maybe some entries contain non-numeric values?

The documentation suggests the use of colClasses parameter.

dt <- fread("data.csv", header = T, colClasses = c("numeric", "character"))

The documentation has a warning for using this parameter:

A character vector of classes (named or unnamed), as read.csv. Or a named list of vectors of column names or numbers, see examples. colClasses in fread is intended for rare overrides, not for routine use. fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.

zacdav
  • 4,603
  • 2
  • 16
  • 37
  • I tried the colClasses = c("integer", "character"). The str is still char for the id. – MLE Apr 27 '18 at 00:10
  • 1
    thats simply the printing method. – zacdav Apr 27 '18 at 00:25
  • The read.csv(base) correctly reads the data, even for the id. The fread treats every single variable as char – MLE Apr 27 '18 at 00:32
  • use base R and tell me the output of the following, `as.integer(dt2$id)`, are there any warnings. `data.table::fread` is usually pretty clever and can figure out column types. – zacdav Apr 27 '18 at 00:34
  • Warning message: NAs introduced by coercion to integer range – MLE Apr 27 '18 at 00:41
  • OUTPUT: [1000] NA [ reached getOption("max.print") -- omitted 498974 entries ] – MLE Apr 27 '18 at 00:41
  • looks like some entries have a character to me, run this to see: `length(grep("[[:alpha:]]", dt2$id))`. Actually lets try numeric instead of integer, try the updated post. – zacdav Apr 27 '18 at 00:46
  • I have run the dt2, there are some NAs. length(grep("[[:alpha:]]", dt2$id)) [1] 200 – MLE Apr 27 '18 at 00:50
  • try `colClasses = c("numeric", "character")` – zacdav Apr 27 '18 at 00:51
  • Still everything char – MLE Apr 27 '18 at 00:55
  • 1
    Can't help anymore then, I would manually convert after read-in at your own risk. It seems as if something is odd in the data. – zacdav Apr 27 '18 at 00:59
0

It looks as if the fread command will detect the type in a particular column and then assign the lowest type it can to that column based on what the column contains. From the fread documentation:

A sample of 1,000 rows is used to determine column types (100 rows from 10 points). The lowest type for each column is chosen from the ordered list: logical, integer, integer64, double, character. This enables fread to allocate exactly the right number of rows, with columns of the right type, up front once. The file may of course still contain data of a higher type in rows outside the sample. In that case, the column types are bumped mid read and the data read on previous rows is coerced.

This means that if you have a column with mostly numeric type values it might assign the column as numeric, but then if it finds any character type values later on it will coerce anything read up to that point to character type.

You can read about these type conversions here, but the long and short of it seems to be that trying to convert a character column to numeric for values that are not numeric will result in those values being converted to NA, or a double might be converted to an integer, leading to a loss of precision.

You might be okay with this loss of precision, but fread will not allow you to do this conversion using colClasses. You might want to go in and remove non-numeric values yourself.

Mihai Chelaru
  • 7,614
  • 14
  • 45
  • 51