I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:
test <- read.table(
text = "
A 1900 1 10 45tz tztime1 01.06.1900
A 1900 2 9 45tz tztime1 01.06.1900
A 1900 3 8 45tz tztime1 01.06.1900
A 1900 4 7 45tz tztime1 01.06.1900
A 1900 5 6 45tz tztime1 01.06.1900
A 1900 6 5 45tz tztime1 01.06.1900
A 1900 7 4 45tz tztime1 01.06.1900
A 1900 7 10 45tz tztime1 01.06.1900
A 1900 7 9 45tz tztime1 01.06.1900
A 1900 8 3 45tz tztime1 01.06.1900
A 1900 8 10 45tz tztime1 01.06.1900
A 1900 8 9 45tz tztime1 01.06.1900
A 2000 1 10 45tz tztime2 01.06.2000
A 2000 2 9 45tz tztime2 01.06.2000
A 2000 3 8 45tz tztime2 01.06.2000
A 2000 3 10 45tz tztime2 01.06.2000
A 2000 3 9 45tz tztime2 01.06.2000
B 1800 1 10 52fd tztime0 01.06.1800
B 1800 2 9 52fd tztime0 01.06.1800
B 1800 3 8 52fd tztime0 01.06.1800
B 1800 3 10 52fd tztime0 01.06.1800
B 1800 3 9 52fd tztime0 01.06.1800
B 1800 4 7 52fd tztime0 01.06.1800
B 1900 1 10 52fd tztime1 01.06.1900
B 1900 2 9 52fd tztime1 01.06.1900
B 1900 2 10 52fd tztime1 01.06.1900
B 1900 2 9 52fd tztime1 01.06.1900
",header=TRUE)
library(data.table)
setDT(test)
names(test) <- c("ID", "Year", "Count", "value", "A","B","C")
In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.
I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:
A 1900 7 4
A 1900 7 10
A 1900 7 9
I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only
A 1900 7 4
as a remainder.
Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.
I hope that was clear enough. If not, feel free to ask for any information that is missing.
Edit: my real dataset has a lot more columns than displayed here, so in the end I am looking for a solution which displays the information of all the columns (for example, assume in this case there are also column A, B and C as part of the dataset, which I have added in the latest edit. They are not really needed for the grouping/filtering, but still should be part of the final result). The currently proposed solution does not account for this.