I know this issue has been raised in several places and I have been trying to find out a possible good solution for hours but failed. That's why I'm asking this.
So, I have a huge data file (~5GB) and I used fread()
to read this
library(data.table)
df<- fread('output.txt', sep = "|", stringsAsFactors = TRUE)
head(df, 5)
age income homeowner_status_desc marital_status_cd gender
1: $35,000 - $49,999
2: 35 - 44 $35,000 - $49,999 Rent Single F
3: $35,000 - $49,999
4:
5: $50,000 - $74,999
str(df)
Classes ‘data.table’ and 'data.frame': 999 obs. of 5 variables:
$ age : chr "" "35 - 44" "" "" ...
$ income : chr "$35,000 - $49,999" "$35,000 - $49,999" "$35,000 - $49,999" "" ...
$ homeowner_status_desc: chr "" "Rent" "" "" ...
$ marital_status_cd : chr "" "Single" "" "" ...
$ gender : chr "" "F" "" "" ...
- attr(*, ".internal.selfref")=<externalptr>
There are missing data(where it's blank). In the original data, there are lots of columns and thus I need to find a way to make columns Factor whenever columns include strings. Could anyone suggest what is the best practice to get this done? I was considering changing it to data frame and do this. But is it possible to do this while it's a data.table?