1

I am not sure I understand the behavior of fread regarding empty strings. for instance

rawdata <- 'a,b\n"",""\nabc,2020-12-31 00:00:00'
fread(rawdata,na.strings=c("","NA"))
##      a                   b
## 1:                        
## 2: abc 2020-12-31 00:00:00

I was expecting NA, in the first row. Are my assumptions flawed?

In the same line, it is possible to have full control on the colClasses and the na.strings at the same time?

Say I want to read columns a and b as character.

rawdata <- 'a,b\n"",""\n1,2020-12-31 00:00:00'
fread(rawdata,na.strings=c("","NA"),
      colClasses=c(a="character",
                   b="character"))

I'm using data.table_1.13.6

update

Part of the answer has already been answered here It seems that fread uses a different parser that read.csv which might result into unexpected behavior.

One solution could be to replace all empty string by NA. see here. But I am not sure this process is faster than read_csv

Henrik
  • 65,555
  • 14
  • 143
  • 159
DJJ
  • 2,481
  • 2
  • 28
  • 53
  • `rawdata <- 'a,b\n,\n1,2020-12-31 00:00:00'` then empty will become `NA` – s_baldur Mar 18 '21 at 10:15
  • thanks for the suggestion. I don't have control on `rawdata` directly, I can control it only after `fread` – DJJ Mar 18 '21 at 10:18
  • 1
    The first part is answered here: [fread: empty string (“”) in na.strings is not interpreted as NA](https://stackoverflow.com/questions/64798564/fread-empty-string-in-na-strings-is-not-interpreted-as-na) – Henrik Mar 18 '21 at 10:53
  • @Henrik many thanks for pointing this out. The post you refer to had escaped me. – DJJ Mar 18 '21 at 10:59

1 Answers1

0

So once the fread parser is out of the way, then it is clear that colClasses and na.strings can be used simultaneously.

Note that trying to put empty quotes as na.strings does not do the job.

rawdata <- 'a,b\n"",""\n1,2020-12-31 00:00:00'
fread(rawdata,na.strings=c('\"\"',"","NA"),
      colClasses=c(a="character",
                   b="character"))
```


```
rawdata <- 'a,b\n,2020-12-31 00:00:00\n1,'
fread(rawdata,na.strings=c("","NA"),
      colClasses=c(a="numeric",
                   b="character"))
```

    ##     a                   b
    ## 1: NA 2020-12-31 00:00:00
    ## 2:  1                <NA>

DJJ
  • 2,481
  • 2
  • 28
  • 53