Removing duplicates requires a transpose, but my dataframe is too large

Question

I had asked a question here. I had a simple dataframe, for which I was attempting to remove duplicates. Very basic question.

Akrun gave a great answer, which was to use this line:

df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]

I went ahead and did this, which worked great on the dummy problem. But I have 3.5 million records that I'm trying to filter.

In an attempt to see where the bottleneck is, I broke the code into steps.

step1 <- apply(df1[1:2], 1, sort)
step2 <- t(step1)
step3 <- data.frame(step2, df1$location)
step4 <- !duplicated(step3)
final <- df1[step4, ,]

step 1 look quite a long time, but it wasn't the worst offender.
step 2, however, is clearly the culprit.

So I'm in the unfortunate situation where I'm looking for a way to transpose 3.5 million rows in R. (Or maybe not in R. Hopefully there is some way to do it somewhere).

Looking around, I saw a few ideas

install the WGCNA library, which has a transposeBigData function. Unfortunately this package is not longer being maintained, and I can't install all the dependencies.
write the data to a csv, then read it in line by line, and transpose each line one at a time. For me, even writing the file run overnight with no completion.

This is really strange. I just want to remove duplicates. For some reason, I have to transpose a dataframe in this process. But I can't transpose a dataframe this large.

So I need a better strategy for either removing duplicates, or for transposing. Does anyone have any ideas on this?

By the way, I'm using Ubuntu 14.04, with 15.6 GiB RAM, for which cat /proc/cpuinfo returns

Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
model name : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
cpu MHz : 1200.000
cache size : 6144 KB

Thanks.

df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))

Are there only 2 `id_` columns to sort and search for duplicates? Which of the four columns you need to account for when searching for duplicates? Only `id1` and `id2`, or `location` too? — alexis_laz, Jul 30 '16 at 18:16

akrun · Accepted Answer · 2016-07-30T18:32:13.180

2

A faster option would be using pmin/pmax with data.table

library(data.table)
setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2)))]
#     id1 id2   location comment
#1:   1   2     Alaska    cold
#2:   3   4 California    nice
#3:   4   5     Kansas  boring
#4:   9  10     Alaska    cold

If 'location' also needs to be included to find the unique

setDT(df)[!duplicated(data.table(pmin(id1, id2), pmax(id1, id2), location))]

edited Jul 30 '16 at 18:32

answered Jul 30 '16 at 02:09

akrun

874,273
37
540
662

Got this error -- have you seen it before? I'm it googling now. `Error in duplicated.default(pmin(id1, id2), pmax(id1, id2), : 'fromLast' must be TRUE or FALSE` – Monica Heddneck Jul 30 '16 at 02:26
@MonicaHeddneck I am using your example and there is no error for me. I am using R 3.3.0 and data.table_1.9.7, but 1.9.6 shoudl also work – akrun Jul 30 '16 at 02:51
1

@MonicaHeddneck Please check on a random subset of rows in your dataset to see where exactly you get the error and then use `dput` to show that small dataset – akrun Jul 30 '16 at 02:56
@MonicaHeddneck Do you have NA elements in 'id1', 'id2'? – akrun Jul 30 '16 at 02:59
@MonicaHeddneck Can you try with `setDT(df)[!duplicated(pmin(id1, id2), pmax(id1, id2), location, fromLast = FALSE)]` (though we don't have to specify for default cases, but based on your error) – akrun Jul 30 '16 at 03:01
I guess the `duplicated` call does not seem valid -- `pmax(id1, id2)` is passed as "incomparables" and `location` is passed as "nmax" in the current case. I think it should be a separate `duplicated` for each component and, then `"&"` the results. – alexis_laz Jul 30 '16 at 17:09
1

@alexis_laz I think you are right. I need to construct a data.table or `cbind` . – akrun Jul 30 '16 at 18:31
I don't know exactly what you did, I'll have to look at the change you made. But it worked. Amazingly fast. This is pretty shocking. Thank you for putting in so much effort. – Monica Heddneck Jul 30 '16 at 18:59

Monica Heddneck · Answer 2 · 2016-07-30T18:50:21.743

So after struggling with this for most of the weekend (grateful for plenty of selfless help from the illustrious @akrun), I realized that I would need to go about this in a completely different manner.

Since the dataframe was simply too large to process in memory, I ended up using a strategy where I pasted together a (string) key and column-bound it onto the dataframe. Next, I collapsed the key and sorted the characters. Here I could use which to get the index of the rows that contained non-duplicate keys. With that I could filter the my dataframe.

df_with_key <- within(df,  key <- paste(boxer1, boxer2, date, location, sep=""))

strSort <- function(x)
  sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")

df_with_key$key <- strSort(df_with_key$key)

idx <- which(!duplicated(df_with_key$key))

final_df <- df[idx,]

Can you check if the comment posted by alexis_laz is the root of the problem. I updated the post. — akrun, Jul 30 '16 at 18:32

Removing duplicates requires a transpose, but my dataframe is too large

2 Answers2