I have the following line of code:
train <- read.csv("avito_train.tsv", sep='\t', stringsAsFactors = F)
The training file is around 3 GB. It takes a really long time to load all of that data.
My question is, would a proper data scientist load all of the data or only use a subset? I notice I could use nrows
parameter to specify a maximum number of rows to read.
I also believe that loading all of this data into a corpus (as I have to do) will probably be very time consuming. Is there a general consensus on the recommended strategy of writing machine learning programs with large training and testing data?