In the given dataset, case_control
indicates whether a row is a case
or control
, id
is an identifier which is unique for case
but it can be repeated for control
and group
indicates cluster. I need to select one control per case within each group
but if a control is previous selected for a case, it cannot be selected for the next case, based on the id
variable. If there are no available controls, the case will have to be dropped.
How can I achieve this to work quickly in a very large dataset with ~10 million rows (with 2 mil cases and 8 mil controls)?
Dataset looks like this(https://docs.google.com/spreadsheets/d/1MpjKv9Fm_Hagb11h_dqtDX4hV7G7sZrt/edit#gid=1801722229)
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_1 control 22
cluster_1 control 23
cluster_2 case 12
cluster_2 control 21
cluster_2 control 22
cluster_2 control 24
cluster_3 case 13
cluster_3 control 21
cluster_3 control 22
cluster_3 control 25
Expected output must look like this
group case_control id
cluster_1 case 11
cluster_1 control 21
cluster_2 case 12
cluster_2 control 22
cluster_3 case 13
cluster_3 control 25