Filter data frame based on duplicates in columns

Question

I have a very large data frame that looks something like this:

Gene Sample1 Sample2 
   A       1       0
   A       0       1
   A       1       1
   B       1       1
   C       0       1 
   C       0       0

I want to only keep rows where there is a duplicate in the Gene column.

So the table would become:

Gene Sample1 Sample2 
   A       1       0
   A       0       1
   A       1       1    
   C       0       1 
   C       0       0

I've tried using subset(df, duplicated(df$Genes)) in R But I think it left over some non- duplicates as the naming is more involved than A/B/C. Like: WASH11, KANSL-1, etc. Can this be done in R or Linux shell?

Rich Scriven · Accepted Answer · 2015-09-06T23:26:10.287

In R, you could double-up on duplicated(), going from both directions.

df[with(df, duplicated(Gene) | duplicated(Gene, fromLast = TRUE)), ]
#   Gene Sample1 Sample2
# 1    A       1       0
# 2    A       0       1
# 3    A       1       1
# 5    C       0       1
# 6    C       0       0

You could also use a table of the first column.

tbl <- table(df$Gene)
df[df$Gene %in% names(tbl)[tbl > 1], ]
#   Gene Sample1 Sample2
# 1    A       1       0
# 2    A       0       1
# 3    A       1       1
# 5    C       0       1
# 6    C       0       0

Other options, which may or may not work depending on the real data are ...

df[(table(df$Gene) > 1)[df$Gene],]  ## credit to Pierre LaFortune
## or
df[with(df, (tabulate(Gene) > 1)[Gene]), ]

@PierreLafortune - nice one. Could also do `df[with(df, (tabulate(Gene) > 1)[Gene]), ]` to save from making names and a bit faster than `table()` — Rich Scriven, Sep 06 '15 at 23:24

score 3 · Answer 2 · answered Sep 06 '15 at 23:31

You can find the number of each by applying ave and counting the entries:

ave(as.numeric(x$Gene), x$Gene, FUN=length)
## [1] 3 3 3 1 2 2

In this expression, the first argument to ave need only be a numeric who's length equals the number of rows in the data frame.

Use this to select rows:

count <- ave(as.numeric(x$Gene), x$Gene, FUN=length)
x[count>1,]
##   Gene Sample1 Sample2
## 1    A       1       0
## 2    A       0       1
## 3    A       1       1
## 5    C       0       1
## 6    C       0       0

score 2 · Answer 3 · answered Sep 06 '15 at 22:20

From command line using Perl

cat counts.txt 
Gene    Sample1 Sample2
A   1   0
A   0   1
A   1   1
B   1   1
C   0   1
C   0   0

perl -ne '$cg{ (split /\t/,$_)[0] }++; push (@lines, $_); END { print shift @lines; foreach (@lines) { print if ($cg{ (split /\t/,$_)[0] } >= 2) }}' counts.txt 
Gene    Sample1 Sample2
A   1   0
A   0   1
A   1   1
C   0   1
C   0   0

%cg hash keeps count of the number of occurrences of each gene. Genes are extracted by selecting only the first element [0] of the split operation on each line. @lines holds entire contents of file in memory by line. Then the END block only outputs those lines whose gene appeared >= 2 times.

not sure why the downvotes...OP asked for R or linux shell. – pcantalupo Sep 06 '15 at 23:51 — pcantalupo, Sep 06 '15 at 23:51
I'm with you. Nice solution IMO. – David Arenburg Sep 06 '15 at 23:59 — David Arenburg, Sep 06 '15 at 23:59

Filter data frame based on duplicates in columns

3 Answers3