R - Sorting a large amount of data efficiently - memory issue

Question

Obligatory system setup in case it helps : Running Windows 10, R 3.2.3 Intel Core i7 2600k. 16 GB RAM. R is set to have access to as much RAM as it wants.

Hello!

I have a few hundred files, each with a data frame of size Nx29 or Nx31.

I am combining ~4 columns that overlap in each of these data frames. So ID1, ID2, Date, Text. I am sorting by date.

So the problem is, these N's can be a few dozen lines in some files, and a few million lines in other files.

My original idea was

Create a blank data frame
Loop -->
    Open a file
    Add that file to the Dataframe.
    Remove any duplicate rows (they happen semi-frequently)
    Open the next file.

However, even doing that results in my data frame getting too large and running out of memory.

I would consider writing to file, but that could be problematic, because let's say I

Read file
Write to file
Open next file
Repeat.

Then once I have everything in 1 massive file, I likely won't be able to load that into R, since if the individual components wont all fit in memory at once, then likely the combined won't either.

However, I want it all sorted by date, and each file can contain ~45 days worth of data or so, so it's hard to see a way to do this.

So how can I effectively get all of my data together to be able to sort it without blowing R up?

Thanks!

It would be helpful to see a few sample excerpts from these files, as well as the code you've tried so far. — bgoldst, Apr 09 '16 at 19:55
maybe you could `file.append()` to an unique file, then import it. — Vincent Bonhomme, Apr 09 '16 at 20:24
I had a similar problem, except that the individual datasets were too large to fit into memory. I ended up using data.table's `fread` to read chucks of the files, I then processed these smaller chunks and saved the temporary files to disk. In a separate R script as well as a new R session, I combined and processed these intermediate files. Not sure if the intermediate processing is possible in your case, but it may be worth thinking about. In my own case, I had to use `fread` because read.table would not release the memory reading rows 1:N-1, even though I told it to start at row N. — lmo, Apr 09 '16 at 22:16
[This post](http://stackoverflow.com/q/35763574/559784) might be of help. — Arun, Apr 10 '16 at 09:14

score 0 · Answer 1 · answered Apr 09 '16 at 20:29

One possibility might be to use an approach along the following lines, in which I assume that you already have a list/vector my_files that specifies all the files you want to open.

Create a helper function my_inspect that reads a file and returns a matrix with two columns: One that specifies the information about Date, and one that contains the line-numbers, but with the twist that a -1 is used for lines you want to skip (the latter one being e.g. duplicated rows). It could perhaps also be an idea to add an extra step where only the lines to include later on will be retained, and where these are sorted according to date.
Use lapply on my_filenames with the function my_inspect, and you will have a list with information about the dates you need in order to sort the information at the end.
If the matrices created by my_inspect has been sorted by date, it should be a rather simple task to identify the file(s) having the first date, and some loop-construction could e.g. be used to work through the files, extract the lines matching the date under investigation, and then save these lines to new files.

This approach might not be the most efficient one, but I think it should be possible to get it done in this manner. You might still need to divide the result into several files, and if that's the case it might be an idea to add a bookkeeping file that can inform your code about the date-ranges the different files contains.

bgoldst · Answer 2 · 2016-04-09T20:40:42.650

It's hard to provide a more specific analysis without more information about your data files and code, but I'll give a demonstration of how I would approach this problem.

First, with respect to reading in data files into an R session, I've found on a few occasions that the fread() function from the data.table package is much more efficient and flexible than the read.table()/read.csv()/read.csv2() family provided by base R. Furthermore, since using data.table objects usually provides performance benefits over plain data.frame objects for many types of operations, especially when you're working with big data, I highly recommend moving into the data.table world when coding in R.

Here I generate some test data, specifically 5 data.tables:

library(data.table);

## generate data
set.seed(0L);
NR <- 10L; NF <- 5L;
for (f in seq_len(NF)) {
    n <- paste0('dt',f);
    assign(n,cbind(
        data.table(
            ID1=sample(1:100,NR),
            ID2=sample(1:100,NR),
            Date=sample(seq(as.Date('2016-01-01'),as.Date('2016-12-31'),1L),NR),
            Text=n
        ),
        replicate(sample(10:15,1L),sample(seq_len(NR))) ## varying column set
    ));
};
dt1; ## just generated 5 of these
##     ID1 ID2       Date Text V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
##  1:  90   7 2016-10-11  dt1  5  9  5 10  4  5  3  7 10  10   6   7
##  2:  27  21 2016-12-07  dt1  6  6  8  3  8  7  1  4  7   5   1   6
##  3:  37  18 2016-03-18  dt1  4  7  4  4  3  4  6  3  3   4   9   2
##  4:  56  67 2016-08-24  dt1  2  4  2  9  9  3  7 10  4   2   5   8
##  5:  88  37 2016-02-15  dt1 10  8  1  8  7 10  5  9  1   9  10   5
##  6:  20  74 2016-04-06  dt1  8 10  6  2  5  2  4  2  6   3   3   3
##  7:  85  47 2016-05-19  dt1  9  1  7  5 10  9  2  1  8   6   7   1
##  8:  96  97 2016-01-05  dt1  1  2  9  7  2  1  8  5  5   1   2   9
##  9:  61  92 2016-05-16  dt1  7  3  3  1  6  8 10  8  9   8   4   4
## 10:  58  35 2016-11-06  dt1  3  5 10  6  1  6  9  6  2   7   8  10

Now write the data.table objects to disk files:

## write files
for (f in seq_len(NF)) {
    n <- paste0('dt',f);
    write.table(get(n),paste0(n,'.txt'),row.names=F);
};

The following code can be used to read in the disk files into a single data.table object. I'm using the select argument of fread() to exclude all but the 4 columns we're interested in, which is important for efficiency and lightening the load on RAM. It also ensures that the rbind() call will not fail due to inconsistent column sets in the arguments.

## read files into single data.table
fns <- list.files(pattern='^dt.*\\.txt$');
fns;
## [1] "dt1.txt" "dt2.txt" "dt3.txt" "dt4.txt" "dt5.txt"
sel <- c('ID1','ID2','Date','Text');
dt <- do.call(rbind,lapply(fns,function(fn) fread(fn,select=sel)));

Finally, we can sort the data.table by the Date column as follows:

dt[order(Date)];
##     ID1 ID2       Date Text
##  1:  96  97 2016-01-05  dt1
##  2:  74  10 2016-01-13  dt5
##  3:  11  26 2016-01-20  dt4
##  4:  86   7 2016-01-24  dt4
##  5:  24  65 2016-01-28  dt5
##  6:  10  17 2016-01-31  dt3
##  7:  88  37 2016-02-15  dt1
##  8:  54  61 2016-02-21  dt4
##  9:   4  89 2016-02-28  dt4
## 10:  14  10 2016-03-11  dt3
## 11:  37  18 2016-03-18  dt1
## 12:  39  88 2016-03-27  dt5
## 13:  20  74 2016-04-06  dt1
## 14:  50  37 2016-04-06  dt3
## 15:  12  77 2016-05-07  dt3
## 16:  38  51 2016-05-15  dt4
## 17:  61  92 2016-05-16  dt1
## 18:  49  85 2016-05-17  dt2
## 19:  85  47 2016-05-19  dt1
## 20:  44  76 2016-05-22  dt2
## 21:  69  52 2016-05-31  dt5
## 22:  27  41 2016-06-18  dt4
## 23:  66  96 2016-06-28  dt3
## 24:  62  40 2016-07-10  dt5
## 25:  15  33 2016-07-17  dt5
## 26:  72  36 2016-07-28  dt4
## 27:  26  67 2016-08-04  dt2
## 28:  46  12 2016-08-05  dt4
## 29:  90  82 2016-08-05  dt5
## 30:  86  97 2016-08-11  dt3
## 31:  62  29 2016-08-18  dt2
## 32:  50   6 2016-08-19  dt2
## 33:  56  67 2016-08-24  dt1
## 34:  62  13 2016-08-25  dt3
## 35:  78  42 2016-08-26  dt4
## 36:  91  70 2016-09-12  dt5
## 37:  33  46 2016-09-19  dt3
## 38:   8  32 2016-09-23  dt2
## 39:  23  59 2016-10-04  dt5
## 40:  90   7 2016-10-11  dt1
## 41:  56  89 2016-11-02  dt2
## 42:  98  49 2016-11-03  dt3
## 43:  58  35 2016-11-06  dt1
## 44:  33  44 2016-11-07  dt2
## 45:  20  31 2016-11-18  dt2
## 46:  18  40 2016-11-19  dt2
## 47:  27  21 2016-12-07  dt1
## 48:   5   7 2016-12-13  dt3
## 49:  27  81 2016-12-13  dt5
## 50:   5  90 2016-12-22  dt4
##     ID1 ID2       Date Text

R - Sorting a large amount of data efficiently - memory issue

2 Answers2