1

I have a very large data frame that looks something like this:

Gene Sample1 Sample2 
   A       1       0
   A       0       1
   A       1       1
   B       1       1
   C       0       1 
   C       0       0 

I want to only keep rows where there is a duplicate in the Gene column.

So the table would become:

Gene Sample1 Sample2 
   A       1       0
   A       0       1
   A       1       1    
   C       0       1 
   C       0       0  

I've tried using subset(df, duplicated(df$Genes)) in R But I think it left over some non- duplicates as the naming is more involved than A/B/C. Like: WASH11, KANSL-1, etc. Can this be done in R or Linux shell?

Braiam
  • 1
  • 11
  • 47
  • 78

3 Answers3

3

In R, you could double-up on duplicated(), going from both directions.

df[with(df, duplicated(Gene) | duplicated(Gene, fromLast = TRUE)), ]
#   Gene Sample1 Sample2
# 1    A       1       0
# 2    A       0       1
# 3    A       1       1
# 5    C       0       1
# 6    C       0       0

You could also use a table of the first column.

tbl <- table(df$Gene)
df[df$Gene %in% names(tbl)[tbl > 1], ]
#   Gene Sample1 Sample2
# 1    A       1       0
# 2    A       0       1
# 3    A       1       1
# 5    C       0       1
# 6    C       0       0

Other options, which may or may not work depending on the real data are ...

df[(table(df$Gene) > 1)[df$Gene],]  ## credit to Pierre LaFortune
## or
df[with(df, (tabulate(Gene) > 1)[Gene]), ]
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • 1
    @PierreLafortune - nice one. Could also do `df[with(df, (tabulate(Gene) > 1)[Gene]), ]` to save from making names and a bit faster than `table()` – Rich Scriven Sep 06 '15 at 23:24
3

You can find the number of each by applying ave and counting the entries:

ave(as.numeric(x$Gene), x$Gene, FUN=length)
## [1] 3 3 3 1 2 2

In this expression, the first argument to ave need only be a numeric who's length equals the number of rows in the data frame.

Use this to select rows:

count <- ave(as.numeric(x$Gene), x$Gene, FUN=length)
x[count>1,]
##   Gene Sample1 Sample2
## 1    A       1       0
## 2    A       0       1
## 3    A       1       1
## 5    C       0       1
## 6    C       0       0
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
2

From command line using Perl

cat counts.txt 
Gene    Sample1 Sample2
A   1   0
A   0   1
A   1   1
B   1   1
C   0   1
C   0   0

perl -ne '$cg{ (split /\t/,$_)[0] }++; push (@lines, $_); END { print shift @lines; foreach (@lines) { print if ($cg{ (split /\t/,$_)[0] } >= 2) }}' counts.txt 
Gene    Sample1 Sample2
A   1   0
A   0   1
A   1   1
C   0   1
C   0   0

%cg hash keeps count of the number of occurrences of each gene. Genes are extracted by selecting only the first element [0] of the split operation on each line. @lines holds entire contents of file in memory by line. Then the END block only outputs those lines whose gene appeared >= 2 times.

pcantalupo
  • 2,212
  • 17
  • 27