Loading large training data with R

Question

I have the following line of code:

train <- read.csv("avito_train.tsv", sep='\t', stringsAsFactors = F)

The training file is around 3 GB. It takes a really long time to load all of that data.

My question is, would a proper data scientist load all of the data or only use a subset? I notice I could use nrows parameter to specify a maximum number of rows to read.

I also believe that loading all of this data into a corpus (as I have to do) will probably be very time consuming. Is there a general consensus on the recommended strategy of writing machine learning programs with large training and testing data?

there are ways to optimize data loading in R. The more you tell R about the structure of the data it has to load, the faster it will be able to load it. Specifically, if you can tell it how many columns there are, what the data types are for each of the columns, and if you also happen to know how many rows the data has R will have a easier time loading it. — bdeonovic, Jun 26 '14 at 13:09
Thanks; any code samples would be helpful to me, as well; though, I assume you are referring to the parameters on the `read.csv()` method. — user1477388, Jun 26 '14 at 13:10
Thanks I'll try to provide a good example. Yes I was referring to read.csv(). Actually I was reffering to read.table, and read.csv() is just a wrapper function with certain preset defaults. — bdeonovic, Jun 26 '14 at 13:11
Have a look at `fread()` from the package `data.table`. It is much faster than `read.csv()`. Also, you can try to keep as much of your data as possible in binary format that can be loaded into R faster, using the functions `load()` and `save()`. — konvas, Jun 26 '14 at 13:14
@konvas Thanks. I forgot that I could use the `save()` functionality within R. That should be helpful, too. — user1477388, Jun 26 '14 at 13:23

score 2 · Accepted Answer · edited May 23 '17 at 11:49

There are ways to optimize data loading in R. The more you tell R about the structure of the data it has to load, the faster it will be able to load it. Specifically, if you can tell it how many columns there are, what the data types are for each of the columns, and if you also happen to know how many rows the data has R will have a easier time loading it.

Here is an example where I was trying to increase the speed of a file to load. First I get the number of columns in the file:

ncols       <- length(read.table(file,header=TRUE,sep="\t", nrows=1, na.strings=c("null","NA"),comment="",quote=""))

In this particular file I knew that I did not want the first 2 columns, and the rest were numeric, so I used the following read.table call

tumor_data  <- read.table(file,header=TRUE,sep="\t",
                     colClasses=c("NULL","NULL",rep("numeric",ncols-2)),
                     na.strings=c("null","NA"),comment="",quote="")

Note this file wasn't quite as large as yours, so I imagine loading a 3Gb file is still going to take awhile.

You might look at the answers on this page Quickly reading very large tables as dataframes in R which I found helpful. If read.table doesn't work out for you you may want to consider using the sqldf package or somethign similar.

Thanks, I will try this out. So, the answer seems to be, "yes, definitely load all of the data but just change the way you go about it." — user1477388, Jun 26 '14 at 13:20
Using all the data available is always optimal, even if you are training on just a portion of it, you will probably want the other portion to test on etc. For development reasons to test out your code I can certainly envision only needing to load in a bit of the data to make sure everything runs as it is supposed to. — bdeonovic, Jun 26 '14 at 13:47

Loading large training data with R

1 Answers1