Filtering data in R (complex)

Question

I have a dataset with 7 million records.

I need to filter the data to only show about 9000 of these.

The first field dmg is effectively the primary key and take the format 1-Apr-123456. There are about 12 occurrences of each dmg value.

Another column is O_Y and takes the value of 0 or 1. It is most often 0, but 1 on about 900 occasions.

I would like to return all the rows with the same dmg value, where at least one of those records has and O_Y value of 1.

Can you read the data into R (RAM)? If not, package `sqldf` could be up to the task. — Roman Luštrik, May 14 '13 at 16:22
[Read this](http://stackoverflow.com/q/5963269) in order to learn how to make a good question — Jilber Urbina, May 14 '13 at 16:23
Like this? Assume `0_Y` is the j-th column. `foo<-readLines(datafile,n=1); if foo[j]==1 my.data<-rbind(my.data,foo)` , and loop on that for as many records as you have. — Carl Witthoft, May 14 '13 at 18:13
Roman, thanks - yes, it can read in ok to R. Takes a few minutes but is ok. — threeisles, May 14 '13 at 19:25
Thanks, Carl, I'll have a go at that,and let you know how I get on. — threeisles, May 14 '13 at 19:28

asb · Answer 1 · 2013-05-14T22:45:58.723

I recommend using data.table for doing this (fread in data.table will be quite handy in reading in the large data set too as you say you have enough RAM).

I am not sure that the following is the best way to do this in data.table but, at least, it should get you started. Hopefully, someone else will come along and list the most idiomatic data.table way for this. But this is what I can think of right now:

Assuming your data.table is called DT and has two columns dmg and O_Y. Use O_Y as the index key for DT and subset DT for O_Y == 1 (DT[.(1)] in data.table syntax). Now find the corresponding dmg values. The unique of these dmg values is your keys.with.ones. All this is succinctly done as follows:

setkey(DT, O_Y)
keys.with.ones <- unique(DT[.(1), dmg][["dmg"]])

Next, we need to extract rows corresponding to these values of dmg. For this we need to change the key for DT to dmg and extract the rows corresponding to the keys above:

setkey(DT, dmg)
DT.filtered <- DT[.(keys.with.ones)]

And we are done. :)

Please refer to ?data.table to figure out a better method if possible and let us know.

Filtering data in R (complex)

1 Answers1