Identifying duplicate columns in a dataframe

Question

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.

My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))

This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):

summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))

If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?

score 34 · Answer 1 · answered Nov 05 '15 at 19:04

34

How about:

testframe[!duplicated(as.list(testframe))]

answered Nov 05 '15 at 19:04

Mostafa Rezaei

599
5
9

2

This is by far the fastest method I've used for testing for dups on a data.frame – Zelazny7 Apr 14 '16 at 01:10
this answer is amazing! – stats_noob Dec 09 '21 at 22:35

score 24 · Accepted Answer · answered Mar 22 '12 at 07:58

24

You can do with lapply:

testframe[!duplicated(lapply(testframe, summary))]

summary summarizes the distribution while ignoring the order.

Not 100% but I would use digest if the data is huge:

library(digest)
testframe[!duplicated(lapply(testframe, digest))]

answered Mar 22 '12 at 07:58

kohske

65,572
8
165
155

2

In addition to @kohske 's suggestion to use `digest`, it might suffice to use `c` instead of `summary` as the `lapply` function. – BenBarnes Mar 22 '12 at 08:18
3

It should be noted that summary for character vectors will produce the same summary even though they are different. This is because summary on a character vector only outputs the vector's length – hshihab Mar 09 '16 at 09:09

score 11 · Answer 3 · answered Mar 09 '16 at 09:33

11

A nice trick that you can use is to transpose your data frame and then check for duplicates.

duplicated(t(testframe))

answered Mar 09 '16 at 09:33

hshihab

416
5
16

Henry · Answer 4 · 2012-03-22T08:20:42.580

unique(testframe, MARGIN=2)

does not work, though I think it should, so try

as.data.frame(unique(as.matrix(testframe), MARGIN=2))

or if you are worried about numbers turning into factors,

testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]

which produces

   age height gender
1   18   76.1      M
2   19   77.0      F
3   20   78.1      M
4   21   78.2      M
5   22   78.8      F
6   23   79.7      F
7   24   79.9      M
8   25   81.1      M
9   26   81.2      F
10  27   81.8      M
11  28   82.8      F
12  29   83.5      M

score 2 · Answer 5 · answered Aug 27 '19 at 20:22

2

It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:

names(df)[duplicated(names(df))]

answered Aug 27 '19 at 20:22

Matt Elgazar

707
1
8
21

score 1 · Answer 6 · answered Oct 20 '19 at 17:47

1

What about just:

unique.matrix(testframe, MARGIN=2)

answered Oct 20 '19 at 17:47

Vitali Avagyan

1,193
1
7
17

score 0 · Answer 7 · answered Jan 04 '17 at 09:33

0

Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)

require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))

answered Jan 04 '17 at 09:33

Holger Brandl

10,634
3
64
63

score 0 · Answer 8 · answered Mar 09 '18 at 11:46

0

Here is a simple command that would work if the duplicated columns of your data frame had the same names:

testframe[names(testframe)[!duplicated(names(testframe))]]

answered Mar 09 '18 at 11:46

Fabio Natalini

197
2
2

Can you share your code? Then we could have a look and try to find a solution. Happy to help – Fabio Natalini Jun 06 '21 at 08:38

score 0 · Answer 9 · answered Sep 26 '19 at 16:36

If the problem is that dataframes have been merged one time too many using, for example:

    testframe2 <- merge(testframe, testframe, by = c('age'))

It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:

    testframe2 <- testframe2[!duplicated(as.list(testframe2))]
    names(testframe2) <- gsub('.x','',names(testframe2))

Emmanuel-Lin · Answer 10 · 2021-12-09T12:11:43.973

Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.

This a function I implemented in dataPreparation package.

The function

dataPreparation::which_are_bijection

which_are_in_double(testframe)

Which return 3 and 4 the columns that are duplicated in your example

Build a data set with wanted dimensions for performance tests

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)

for (i in 1:12){
  testframe = rbind(testframe,testframe)  
}
# Result in 49152 rows
for (i in 1:5){
  testframe = cbind(testframe,testframe)  
}
# Result in 160 columns

The benchmark To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times

benchmark(
  which_are_in_double(testframe, verbose=FALSE),
  duplicated(lapply(testframe, summary)),
  duplicated(lapply(testframe, digest))
)

                                             test replications elapsed
3           duplicated(lapply(testframe, digest))          100  39.505
2          duplicated(lapply(testframe, summary))          100  20.412
1 which_are_in_double(testframe, verbose = FALSE)          100  13.581

So which are bijection 3 to 1.5 times faster than other proposed solutions.

NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.

NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Identifying duplicate columns in a dataframe

10 Answers10

Linked

Related