16

Is there a way for fread to mimic the behaviour of read.table whereby the class of the variable is set by the data that is read in.

I have numeric data with a few comments underneath the main data. When i use fread to read in the data, the columns are converted to character. However, by setting the nrow in read.table` i can stop this behaviour. Is this possible in fread. (I would prefer not to alter the raw data or make an amended copy). Thanks

An example

d <- data.frame(x=c(1:100, NA, NA, "fff"), y=c(1:100, NA,NA,NA)) 
write.csv(d, "test.csv",  row.names=F)

in_d <- read.csv("test.csv", nrow=100, header=T)
in_dt <- data.table::fread("test.csv", nrow=100)

Which produces

> str(in_d)
'data.frame':   100 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10 ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
> str(in_dt)
Classes ‘data.table’ and 'data.frame':  100 obs. of  2 variables:
 $ x: chr  "1" "2" "3" "4" ...
 $ y: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, ".internal.selfref")=<externalptr>

As a workaround I thought i would be able to use read.table to read in one line, get the class and set the colClasses, but i am misunderstanding.

cl <- read.csv("test.csv", nrow=1,  header=T)
cols <- unname(sapply(cl, class))
in_dt <- data.table::fread("test.csv", nrow=100, colClasses=cols)
str(in_dt)

Using Windows8.1 R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
user2957945
  • 2,353
  • 2
  • 21
  • 40
  • 3
    Sounded like a reasonable plan but then I actually read the help page: "fread will only promote a column to a higher type if colClasses requests it. It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss." It appears that even limiting the read to 5 rows fails. I think I remember that colClasses mechanism was a fairly recent addition, so maybe you should submit a feature request. Matthew and Arun are often very accomodating. – IRTFM Apr 07 '15 at 19:21
  • 2
    Surely there must be a DT strategy to coerce all columns to numeric? Set `.SDcols` to an appropriate vector and something like this: `DT[, .SD := lapply(.SDcols, as.numeric), .SDcols=vec]`. I'm not a DT user but I'm sure there is a minimal-typing approach of some sort and I suspect you can find it illustrated in SO answers. – IRTFM Apr 07 '15 at 19:28
  • @BondedDust; Im not a DT user either, its just read.table is having(more serious) issues with my data than fread is. Ill have a look on SO. thanks – user2957945 Apr 07 '15 at 19:38
  • Look, stop revert-warring your title. It's not about "using nrows". Writing out a string to an integer column is intentionally writing a defective csv. Your issue is ***"Can I prevent data.table's fread column-class inference getting overridden by trailing string comment rows in my data?"*** You know that comment rows should start with a comment-char (like `#`). When we prepend that to your comment text, we get the correct behavior with `read.csv` (but not `fread`). Yes, `fread` could do with an enhance. Meantime needs a workaround. – smci Apr 07 '15 at 20:39
  • thanks for your input. . `"Writing out a string to an integer column is intentionally writing a defective csv."` - this was a `mwe` to reflect some data that i have, and as such it is not commented. If i added `#` to the file it would not reflect the data that i have. btw i didnt revert. – user2957945 Apr 07 '15 at 20:50
  • Ok. Can anyone show (with citation) that this is not an abuse of CSV? Just because `read.csv(nrow)` accepts it and gets the type inference right. – smci Apr 07 '15 at 21:04

2 Answers2

18

Option 1: Using a system command

fread() allows the use of a system command in its first argument. We can use it to remove the quotes in the first column of the file.

indt <- data.table::fread("cat test.csv | tr -d '\"'", nrows = 100)
str(indt)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

The system command cat test.csv | tr -d '\"' explained:

  • cat test.csv reads the file to standard output
  • | is a pipe, using the output of the previous command as input for the next command
  • tr -d '\"' deletes (-d) all occurrences of double quotes ('\"') from the current input

Option 2: Coercion after reading

Since option 1 doesn't seem to be working on your system, another possibility is to read the file as you did, but convert the x column with type.convert().

library(data.table)
indt2 <- fread("test.csv", nrows = 100)[, x := type.convert(x)]
str(indt2)
# Classes ‘data.table’ and 'data.frame':    100 obs. of  2 variables:
#  $ x: int  1 2 3 4 5 6 7 8 9 10 ...
#  $ y: int  1 2 3 4 5 6 7 8 9 10 ...
#  - attr(*, ".internal.selfref")=<externalptr> 

Side note: I usually prefer to use type.convert() over as.numeric() to avoid the "NAs introduced by coercion" warning triggered in some cases. For example,

x <- c("1", "4", "NA", "6")
as.numeric(x)
# [1]  1  4 NA  6
# Warning message:
# NAs introduced by coercion 
type.convert(x)
# [1]  1  4 NA  6

But of course you can use as.numeric() as well.


Note: This answer assumes data.table dev v1.9.5

Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • Thanks Richard, however, this throws an error on my system. I am using Windows 8.1 – user2957945 Apr 07 '15 at 19:21
  • Thanks for the extra info. Ive installed the dev version but same error on Windows (works, with helpful warning on linux). Could you explain the system call command please. – user2957945 Apr 07 '15 at 19:33
  • 1
    @user2957945 - Okay good to know. I added a second option that should be sufficiently efficient – Rich Scriven Apr 07 '15 at 19:34
  • huh... may have spoken too soon on the error using the system call. Actually looks like `fread` doesn't work at all since the dev version. Ill load from cranand see if it fixes. Thanks again for your help. – user2957945 Apr 07 '15 at 19:45
  • @user2957945 - when you re-do it, use a fresh R session to avoid conflicts with other objects – Rich Scriven Apr 07 '15 at 19:49
  • Yup, interesting. When i upgrade to the dev version on my system, fread stops working with error `Error in fread("test.csv") : showProgress must be 0 or 1, currently`. Downgrading gets it working again. Your second solution works (always on a fresh session ;)) – user2957945 Apr 07 '15 at 19:50
-2

Ok, the customer is abusing CSV format to intentionally write out trailing string rows to an integer column, yet without those rows starting with a comment.char (#).

Then you somehow expect you can override fread()'s type inference to read those as integer, by using nrow to try to limit it to just see the integer rows. read.csv(..., nrow) will accept this, however fread() always uses all rows for type-inference (not just the ones specified by nrow, skip, header), and even if they start with comment.char (that's a bug).

  1. Sounds like an abuse of CSV. Your comment rows should be prepended with #
  2. Yes, fread() needs a fix/enhance to ignore comment rows for type inference.
  3. For now, you can workaround with fread() by post-processing the data-table read in.
  4. It's arguable whether fread() should be changed to support the behavior you want: using nrows to limit what gets exposed to type-inference. It might fix your (pretty unique) case and break some others.

I don't see why you (EDIT: the customer) can't write your comments to a separate .txt/README/data-dictionary file to accompany the .csv. The practice of using a separate data-dictionary file is pretty well-established. I've never seen someone do this to a CSV file. At least move the comments to the header, not a footer.

Community
  • 1
  • 1
smci
  • 32,567
  • 20
  • 113
  • 146
  • The csv doesn't have a multi-row header. It has unwanted stuff at the bottom of the file – Rich Scriven Apr 07 '15 at 20:21
  • Ok, then trailing (string) comment rows, without any leading '#'. Why did you not just manually append them to the write.csv? – smci Apr 07 '15 at 20:23
  • It's not my question, but I'm guessing that OP already has the file and just used `write.csv()` to create an example. – Rich Scriven Apr 07 '15 at 20:24
  • Yes as @RichardScriven; said this was a small example of my data and does not have extra headers.or cells with commented characters. It has additional information at the bottom that changes the class of my variables. We generally store data in non-writable directories (although of course not impossible to get permissions) so i would prefer not to amend and save a new file. – user2957945 Apr 07 '15 at 20:38
  • 3
    Re your update: 1) the data is given to me, hence why it is not commented. 3) yes adding this extra information to the bottom is not helpful, but i regularly receive data like this, where clients add small summaries or text below the main data (unfortunately it is not a unique case for me) – user2957945 Apr 07 '15 at 20:57
  • 4
    I think your arguments are a little beyond the scope of the question. Can't we stipulate that this is a bad data format and then ask about how the OP can deal with it? (I know there's a *reductio ad absurdum* here somewhere, e.g. "my customer is embedding their data as a screen capture in a .mp4 file: how can I use `fread` to read it?", but this case seems far from that.) – Ben Bolker Apr 07 '15 at 21:12
  • PS https://tools.ietf.org/html/rfc4180 says nothing whatsoever about comment lines/metadata (except for the header line) – Ben Bolker Apr 07 '15 at 21:13
  • @Ben Bolker: yes the RFC says nothing whatsoever about comments. The issue of how/whether CSV files can contain comments, and if so in what format, is a longstanding open issue: [Can a CSV file have a comment?](http://stackoverflow.com/questions/1961006/can-a-csv-file-have-a-comment) – smci Apr 07 '15 at 21:18
  • @user2957945 if you regularly get such files, then how about writing a script to automatically extract any trailing comment rows (esp. when any of the fields are numeric) and move them to a separate `.comment` file? – smci Apr 07 '15 at 21:23