It's hard to provide a more specific analysis without more information about your data files and code, but I'll give a demonstration of how I would approach this problem.
First, with respect to reading in data files into an R session, I've found on a few occasions that the fread()
function from the data.table
package is much more efficient and flexible than the read.table()
/read.csv()
/read.csv2()
family provided by base R. Furthermore, since using data.table objects usually provides performance benefits over plain data.frame objects for many types of operations, especially when you're working with big data, I highly recommend moving into the data.table world when coding in R.
Here I generate some test data, specifically 5 data.tables:
library(data.table);
## generate data
set.seed(0L);
NR <- 10L; NF <- 5L;
for (f in seq_len(NF)) {
n <- paste0('dt',f);
assign(n,cbind(
data.table(
ID1=sample(1:100,NR),
ID2=sample(1:100,NR),
Date=sample(seq(as.Date('2016-01-01'),as.Date('2016-12-31'),1L),NR),
Text=n
),
replicate(sample(10:15,1L),sample(seq_len(NR))) ## varying column set
));
};
dt1; ## just generated 5 of these
## ID1 ID2 Date Text V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
## 1: 90 7 2016-10-11 dt1 5 9 5 10 4 5 3 7 10 10 6 7
## 2: 27 21 2016-12-07 dt1 6 6 8 3 8 7 1 4 7 5 1 6
## 3: 37 18 2016-03-18 dt1 4 7 4 4 3 4 6 3 3 4 9 2
## 4: 56 67 2016-08-24 dt1 2 4 2 9 9 3 7 10 4 2 5 8
## 5: 88 37 2016-02-15 dt1 10 8 1 8 7 10 5 9 1 9 10 5
## 6: 20 74 2016-04-06 dt1 8 10 6 2 5 2 4 2 6 3 3 3
## 7: 85 47 2016-05-19 dt1 9 1 7 5 10 9 2 1 8 6 7 1
## 8: 96 97 2016-01-05 dt1 1 2 9 7 2 1 8 5 5 1 2 9
## 9: 61 92 2016-05-16 dt1 7 3 3 1 6 8 10 8 9 8 4 4
## 10: 58 35 2016-11-06 dt1 3 5 10 6 1 6 9 6 2 7 8 10
Now write the data.table objects to disk files:
## write files
for (f in seq_len(NF)) {
n <- paste0('dt',f);
write.table(get(n),paste0(n,'.txt'),row.names=F);
};
The following code can be used to read in the disk files into a single data.table object. I'm using the select
argument of fread()
to exclude all but the 4 columns we're interested in, which is important for efficiency and lightening the load on RAM. It also ensures that the rbind()
call will not fail due to inconsistent column sets in the arguments.
## read files into single data.table
fns <- list.files(pattern='^dt.*\\.txt$');
fns;
## [1] "dt1.txt" "dt2.txt" "dt3.txt" "dt4.txt" "dt5.txt"
sel <- c('ID1','ID2','Date','Text');
dt <- do.call(rbind,lapply(fns,function(fn) fread(fn,select=sel)));
Finally, we can sort the data.table by the Date
column as follows:
dt[order(Date)];
## ID1 ID2 Date Text
## 1: 96 97 2016-01-05 dt1
## 2: 74 10 2016-01-13 dt5
## 3: 11 26 2016-01-20 dt4
## 4: 86 7 2016-01-24 dt4
## 5: 24 65 2016-01-28 dt5
## 6: 10 17 2016-01-31 dt3
## 7: 88 37 2016-02-15 dt1
## 8: 54 61 2016-02-21 dt4
## 9: 4 89 2016-02-28 dt4
## 10: 14 10 2016-03-11 dt3
## 11: 37 18 2016-03-18 dt1
## 12: 39 88 2016-03-27 dt5
## 13: 20 74 2016-04-06 dt1
## 14: 50 37 2016-04-06 dt3
## 15: 12 77 2016-05-07 dt3
## 16: 38 51 2016-05-15 dt4
## 17: 61 92 2016-05-16 dt1
## 18: 49 85 2016-05-17 dt2
## 19: 85 47 2016-05-19 dt1
## 20: 44 76 2016-05-22 dt2
## 21: 69 52 2016-05-31 dt5
## 22: 27 41 2016-06-18 dt4
## 23: 66 96 2016-06-28 dt3
## 24: 62 40 2016-07-10 dt5
## 25: 15 33 2016-07-17 dt5
## 26: 72 36 2016-07-28 dt4
## 27: 26 67 2016-08-04 dt2
## 28: 46 12 2016-08-05 dt4
## 29: 90 82 2016-08-05 dt5
## 30: 86 97 2016-08-11 dt3
## 31: 62 29 2016-08-18 dt2
## 32: 50 6 2016-08-19 dt2
## 33: 56 67 2016-08-24 dt1
## 34: 62 13 2016-08-25 dt3
## 35: 78 42 2016-08-26 dt4
## 36: 91 70 2016-09-12 dt5
## 37: 33 46 2016-09-19 dt3
## 38: 8 32 2016-09-23 dt2
## 39: 23 59 2016-10-04 dt5
## 40: 90 7 2016-10-11 dt1
## 41: 56 89 2016-11-02 dt2
## 42: 98 49 2016-11-03 dt3
## 43: 58 35 2016-11-06 dt1
## 44: 33 44 2016-11-07 dt2
## 45: 20 31 2016-11-18 dt2
## 46: 18 40 2016-11-19 dt2
## 47: 27 21 2016-12-07 dt1
## 48: 5 7 2016-12-13 dt3
## 49: 27 81 2016-12-13 dt5
## 50: 5 90 2016-12-22 dt4
## ID1 ID2 Date Text