9

I know this issue has been raised in several places and I have been trying to find out a possible good solution for hours but failed. That's why I'm asking this.

So, I have a huge data file (~5GB) and I used fread() to read this

library(data.table)
df<- fread('output.txt', sep = "|", stringsAsFactors = TRUE)
head(df, 5)
       age            income homeowner_status_desc marital_status_cd gender
1:         $35,000 - $49,999                                               
2: 35 - 44 $35,000 - $49,999                  Rent            Single      F
3:         $35,000 - $49,999                                               
4:                                                                         
5:         $50,000 - $74,999 
str(df)
Classes ‘data.table’ and 'data.frame':  999 obs. of  5 variables:
 $ age                  : chr  "" "35 - 44" "" "" ...
 $ income               : chr  "$35,000 - $49,999" "$35,000 - $49,999" "$35,000 - $49,999" "" ...
 $ homeowner_status_desc: chr  "" "Rent" "" "" ...
 $ marital_status_cd    : chr  "" "Single" "" "" ...
 $ gender               : chr  "" "F" "" "" ...
 - attr(*, ".internal.selfref")=<externalptr> 

There are missing data(where it's blank). In the original data, there are lots of columns and thus I need to find a way to make columns Factor whenever columns include strings. Could anyone suggest what is the best practice to get this done? I was considering changing it to data frame and do this. But is it possible to do this while it's a data.table?

Arun
  • 116,683
  • 26
  • 284
  • 387
hmi2015
  • 831
  • 3
  • 10
  • 22
  • 2
    From my understanding, data.table does not store anything as a factor by default in order to reduce the amount of storage. You will have to change everything to factors by yourself. I have a line of code that does this for me: `df[,(names(df)):=lapply(.SD, as.factor),.SDcols=names(df)]`. Edit: if you want only character columns, use this: `types <- data.frame(sapply(df, class)); char_list <- row.names(types)[types[[1]] == 'character'] ` and then replace `names(df)` with `char_list` – Michal Jul 10 '15 at 21:09
  • @Michal but factors take less memory to store – rawr Jul 10 '15 at 21:12
  • Have you tried specifying `"factor"` where necessary in the `colClasses` argument to `fread`? – MichaelChirico Jul 10 '15 at 21:14
  • @MichaelChirico, that's not the best way to do in my case since in the original data I have ~70 columns. – hmi2015 Jul 10 '15 at 21:16
  • 1
    I made a little csv file and I can confirm the same behavior where `stringsAsFactors=TRUE` doesn't result in factor columns. Additionally specifying `colClasses` as factor doesn't seem to work either. – Dean MacGregor Jul 11 '15 at 04:55
  • Just implemented the `stringsAsFactors` argument... Will commit soon, and should be available in 1.9.5. And on CRAN as 1.9.6. – Arun Jul 11 '15 at 08:43
  • @rawr Often, but not always. `object.size(as.factor(sample(1:10,1e2,T))) # 1408 bytes` vs `object.size(as.character(sample(1:10,1e2,T))) # 1320 bytes`. Probably relevant: http://stackoverflow.com/questions/18304760 – Frank Jul 11 '15 at 13:32
  • @Frank well of course one can find a counter example, but all I have to do is change that to `1e3` and factors take the lead by orders of magnitude. Plus I assumed that since this was a thread about data table, we would be talking about more than 100 numerics (which take less memory anyway `object.size('1') - object.size(1)`) – rawr Jul 11 '15 at 17:14

4 Answers4

11

Just implemented stringsAsFactors argument for fread in v 1.9.6+

From NEWS:

  1. Implemented stringsAsFactors argument for fread(). When TRUE, character columns are converted to factors. Default is FALSE. Thanks to Artem Klevtsov for filing #501, and to @hmi2015 for this SO post.
Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
0

This is basically a comment, but it's long, so here goes.

You may want to use colClasses to specify which columns are factors.

If you've got a lot of columns, something I've done to simplify is use the following function I wrote:

abbr_to_colClass<-function(inits,counts){
  x<-substring(inits,1:nchar(inits),1:nchar(inits))
  types<-ifelse(x=="c","character",
                ifelse(x=="f","factor",
                       ifelse(x=="i","integer",
                              "numeric")))
  rep(types,substring(counts,1:nchar(counts),1:nchar(counts)))
}

Say you've got a .csv with columns of classes:

character 3
factor    2
integer   1
numeric   5
character 6

Then you could use my function to set

colClasses=abbr_to_colClass("cfinc","32156")

This will particularly save space if you have long strings of one type consecutively.

(I know it's not the most robust function, but it's served me quite well many times when there are a lot fields to be read in)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
0

I made a little csv file and I can confirm the same behavior where stringsAsFactors=TRUE doesn't result in factor columns. Additionally specifying colClasses as factor doesn't seem to work either.

If you run this after fread it'll convert all your character columns to factors

for (j in which(sapply(df, class)=='character')) set(df, i=NULL, j=j, value=as.factor(df[[j]]))
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
0

Try the new readr package, it has been optimized to be 10x faster and not leak memory. Instead of stringsAsFactors, you can now specify col_types argument where you can specify a collector (a custom parser function). Look at the documentation, esp. col_factor/parse_factor.

require(readr)
read_csv(..., col_types=...)
smci
  • 32,567
  • 20
  • 113
  • 146
  • @Arun so maybe you have to make a trial first pass to compile the list of levels. Possibly using the `select` column argument. – smci Jul 11 '15 at 14:24
  • @Arun: it seems pretty performant, Hadley did a big rewrite, people are using it widely, let's hear from the OP how it worked... yes it requires a little effort to specify your levels – smci Jul 11 '15 at 19:16
  • @Arun: typically you might only need to load the first n lines to capture all factor levels, where say n ~ 100,000. It would be good for OP to post head-to-head performance numbers for `readr` vs `fread`. – smci Jul 11 '15 at 19:31
  • @Arun I don't see any point in us discussing further in the abstract; the OP can supply details of their factor levels. Inferring all factor levels can be a task varying from trivial (marital status, age cohort, zipcodes, salary ranges) to hard to impossible - it depends entirely on their dataset. As to performance, let's see the OP (or anyone else) post some numbers. `base::read.csv` leaked memory, we can avoid it now. – smci Jul 11 '15 at 19:55