0

I have a dataset with 7 million records.

I need to filter the data to only show about 9000 of these.

The first field dmg is effectively the primary key and take the format 1-Apr-123456. There are about 12 occurrences of each dmg value.

Another column is O_Y and takes the value of 0 or 1. It is most often 0, but 1 on about 900 occasions.

I would like to return all the rows with the same dmg value, where at least one of those records has and O_Y value of 1.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
threeisles
  • 301
  • 2
  • 8

1 Answers1

0

I recommend using data.table for doing this (fread in data.table will be quite handy in reading in the large data set too as you say you have enough RAM).

I am not sure that the following is the best way to do this in data.table but, at least, it should get you started. Hopefully, someone else will come along and list the most idiomatic data.table way for this. But this is what I can think of right now:

Assuming your data.table is called DT and has two columns dmg and O_Y. Use O_Y as the index key for DT and subset DT for O_Y == 1 (DT[.(1)] in data.table syntax). Now find the corresponding dmg values. The unique of these dmg values is your keys.with.ones. All this is succinctly done as follows:

setkey(DT, O_Y)
keys.with.ones <- unique(DT[.(1), dmg][["dmg"]])

Next, we need to extract rows corresponding to these values of dmg. For this we need to change the key for DT to dmg and extract the rows corresponding to the keys above:

setkey(DT, dmg)
DT.filtered <- DT[.(keys.with.ones)]

And we are done. :)

Please refer to ?data.table to figure out a better method if possible and let us know.

asb
  • 4,392
  • 1
  • 20
  • 30