Delete duplicates in large dataset based on condition

Question

I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:

test <- read.table(
text = "
A   1900    1   10  45tz    tztime1 01.06.1900
A   1900    2   9   45tz    tztime1 01.06.1900
A   1900    3   8   45tz    tztime1 01.06.1900
A   1900    4   7   45tz    tztime1 01.06.1900
A   1900    5   6   45tz    tztime1 01.06.1900
A   1900    6   5   45tz    tztime1 01.06.1900
A   1900    7   4   45tz    tztime1 01.06.1900
A   1900    7   10  45tz    tztime1 01.06.1900
A   1900    7   9   45tz    tztime1 01.06.1900
A   1900    8   3   45tz    tztime1 01.06.1900
A   1900    8   10  45tz    tztime1 01.06.1900
A   1900    8   9   45tz    tztime1 01.06.1900
A   2000    1   10  45tz    tztime2 01.06.2000
A   2000    2   9   45tz    tztime2 01.06.2000
A   2000    3   8   45tz    tztime2 01.06.2000
A   2000    3   10  45tz    tztime2 01.06.2000
A   2000    3   9   45tz    tztime2 01.06.2000
B   1800    1   10  52fd    tztime0 01.06.1800
B   1800    2   9   52fd    tztime0 01.06.1800
B   1800    3   8   52fd    tztime0 01.06.1800
B   1800    3   10  52fd    tztime0 01.06.1800
B   1800    3   9   52fd    tztime0 01.06.1800
B   1800    4   7   52fd    tztime0 01.06.1800
B   1900    1   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
B   1900    2   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
",header=TRUE)
library(data.table)
setDT(test)
names(test) <-  c("ID", "Year", "Count", "value", "A","B","C")

In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.

I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:

A   1900    7   4
A   1900    7   10
A   1900    7   9

I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only

A   1900    7   4

as a remainder.

Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.

I hope that was clear enough. If not, feel free to ask for any information that is missing.

Edit: my real dataset has a lot more columns than displayed here, so in the end I am looking for a solution which displays the information of all the columns (for example, assume in this case there are also column A, B and C as part of the dataset, which I have added in the latest edit. They are not really needed for the grouping/filtering, but still should be part of the final result). The currently proposed solution does not account for this.

You have tagged the question with "sql". What database are you using? — Gordon Linoff, Sep 08 '16 at 13:38
You say "I would like to delete all observations, whose value is larger than the minimum value within each group."...so basically, keep the min value(s) of each group?? — Sotos, Sep 08 '16 at 13:47
In R, maybe `test[,list(Value=min(Value)),by=list(ID,YEAR,Count)]`. — nicola, Sep 08 '16 at 13:49
sorry, I tagged as `sql` because i am also using `sqldf` in R. I should have used that flag instead. Edited. @Sotos yes, exactly. — user3032689, Sep 08 '16 at 14:10
Fyi, you could `library(data.table); test = fread("text text text")` instead. Also, it would be good if you could write an example as a function of `n`, the number of observations, making it easier to test out the performance of various options (since you say your real data is large). — Frank, Sep 08 '16 at 15:30
@Frank alright, but I am not familiar with creating large datasets with million of rows. There would have to be a lot of random sampling involved and still there had to be duplicates every now and then. Edit: If you just ask about the dataset: It has about 2,5 millions of rows and ~ 30 columns — user3032689, Sep 08 '16 at 15:36
The usual guidance for that is here: http://stackoverflow.com/a/28481250/ Yeah, you'd have to think about what data-generating process did a satisfactory job of mimicking your true use-case. Anyway, if you're happy with the answers you have, then it's just something to keep in mind next time you have a question about performance. Anyway, my recommendation for what to try out: `test[test[, min(Value), by=.(ID, YEAR, Count)], on=c("ID", "YEAR", "Count", Value="V1")]` (if I understood correctly). — Frank, Sep 08 '16 at 15:38
@Frank ok,ty, but what if I want to keep columns A, B and C as part of the resulting rows (edited in the question, but it should simply be random columns)? — user3032689, Sep 09 '16 at 12:39
@user3032689 The code from my last comment still works, after you change the capitalization to match. (You switched YEAR to Year, Value to value, etc...) — Frank, Sep 09 '16 at 14:07

score 2 · Accepted Answer · answered Sep 08 '16 at 13:49

2

In R, you can answer this with the following: test[,.(Value=min(Value)), by=.(ID, Year, Count)]

Here we are going through the data and finding the minimum value for each combination of ID, Year, and Count. This uses the data.table syntax from the package data.table

answered Sep 08 '16 at 13:49

Austin

326
1
5

Works perfectly. Thanks. – user3032689 Sep 08 '16 at 14:19
One question: In a larger datatable, the other columns which are not important for the sorting are dropped. How can I prevent this in your solution, assuming the other columns names are called A B and C? – user3032689 Sep 08 '16 at 15:53
It depends on the columns that you have. You will need to pick a summarizing function for them and add them to the first list. Alternatively, if you want unique rows with these values, you would add the column names to the second list (after the by) . `test[,.(Value=min(Value), newVar=sum(newVar), newVar2=max(newVar2)), by=.(ID, Year, Count)]` – Austin Sep 08 '16 at 17:11
Ok I tried to set up the code to get unique rows with these values, but it doesn't work. What is the right syntax? Feel free to edit in your answer as well. Thanks! – user3032689 Sep 08 '16 at 20:51
`test[,.(Value=min(Value), newVar=sum(newVar), newVar2=max(newVar2)), by=.(ID, Year, Count, A, B, C)]` – Austin Sep 08 '16 at 21:15
Ah ok I understand. So the order in the `by` command is important? So when the other columns are listed **after** ID, Year and Count, it is harmless for the sorting mechanism, correct? – user3032689 Sep 09 '16 at 08:41
1

The order of the variables in the by command won't change any of the values in the rows; it will only change the order of the columns in the new table. I don't think it affects the order of the rows. – Austin Sep 09 '16 at 14:16

score 1 · Answer 2 · answered Sep 08 '16 at 13:53

There's two possibility in my point of view depending of the volume of data that needs to be deleted.

If you are looking to delete more than, let's say 20% of the data, your best option is to create a temp table and to insert only the rows that you want keep.

Else you can do a query, which will be long anyway.

For the sql part, you seems want to keep the minimum value of a group so ..

DELETE FROM my_table A
where exists (
 SELECT 1 FROM (
  SELECT ID, YEAR, Count, min(value) as min_value
  FROM my_table 
  GROUP BY ID, YEAR, Count) TMP
 WHERE TMP.ID = A.ID AND TMP.year = A.year AND TMP.count = A.count AND a.value > tmp.min_value)

Due to the query logicly the SGBD will do a full table scan to populate a hastable. If your load of data isn't that big (in size) you might want to try to create an index on all the column : ID, YEAR, COUNT, VALUE

Compare the SQL plan with the two case

Thanks for the entry, also upvoted this because I wanted to know how the sol looks in sqldf – user3032689 46 mins ago — user3032689, Sep 08 '16 at 15:04

Delete duplicates in large dataset based on condition

2 Answers2