I had asked a question here. I had a simple dataframe, for which I was attempting to remove duplicates. Very basic question.
Akrun gave a great answer, which was to use this line:
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
I went ahead and did this, which worked great on the dummy problem. But I have 3.5 million records that I'm trying to filter.
In an attempt to see where the bottleneck is, I broke the code into steps.
step1 <- apply(df1[1:2], 1, sort)
step2 <- t(step1)
step3 <- data.frame(step2, df1$location)
step4 <- !duplicated(step3)
final <- df1[step4, ,]
step 1 look quite a long time, but it wasn't the worst offender.
step 2, however, is clearly the culprit.
So I'm in the unfortunate situation where I'm looking for a way to transpose 3.5 million rows in R. (Or maybe not in R. Hopefully there is some way to do it somewhere).
Looking around, I saw a few ideas
install the WGCNA library, which has a
transposeBigData
function. Unfortunately this package is not longer being maintained, and I can't install all the dependencies.write the data to a csv, then read it in line by line, and transpose each line one at a time. For me, even writing the file run overnight with no completion.
This is really strange. I just want to remove duplicates. For some reason, I have to transpose a dataframe in this process. But I can't transpose a dataframe this large.
So I need a better strategy for either removing duplicates, or for transposing. Does anyone have any ideas on this?
By the way, I'm using Ubuntu 14.04, with 15.6 GiB RAM, for which cat /proc/cpuinfo
returns
Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
model name : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
cpu MHz : 1200.000
cache size : 6144 KB
Thanks.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))