0

I've got a very large dataframe, data (with > 200,000 rows), containing genomic positions for different genes. I want to extract all rows based on different genes and combine them into a new dataframe. For example, I want all rows for SSR1 and STK38.

chrom  txStart ExonCount geneSymbol
chr6   7281287         8       SSR1
chr6   7295624         8       SSR1
chr6   7298155         8       SSR1
chr6  31938951         8      STK19
chr6  31939645         8      STK19
chr6  31940397         8      STK19
chr6  36461668        14      STK38
chr6  36464487        14      STK38
chr6  36465556        14      STK38
chr6 125229391         7        STL
chr6 125241333         7        STL
chr6 125252841         7        STL

Of course, I could do this using the which like below, and then combine them using rbind, but that's too time consuming since I'll be having a lot of genes.

Gene1 <- data[which(data$geneSymbol=="SSR1"), ]
Gene2 <- data[which(data$geneSymbol=="STK38"), ]

I've tried a for loop, but I'm not getting the right output.

genes1 <- 0
genes <- c("SSR1", "STK38")
for (i in genes) {
  genes1 <- print(data[which(data$geneSymbol==i), ])
}

I want it too look like this:

chrom  txStart ExonCount geneSymbol
chr6   7281287         8       SSR1
chr6   7295624         8       SSR1
chr6   7298155         8       SSR1
chr6  36461668        14      STK38
chr6  36464487        14      STK38
chr6  36465556        14      STK38

I'm sure that the solution is very easy, but I've looked all over the web for the past few days without finding a solution.

jbaums
  • 27,115
  • 5
  • 79
  • 119
Claudia
  • 9
  • 2

1 Answers1

0

We can use %in% instead of == for greater than one element

subset(data, geneSymbol %in% c("SSR1", "STK38"))
#   chrom  txStart ExonCount geneSymbol
#1  chr6  7281287         8       SSR1
#2  chr6  7295624         8       SSR1
#3  chr6  7298155         8       SSR1
#7  chr6 36461668        14      STK38
#8  chr6 36464487        14      STK38
#9  chr6 36465556        14      STK38

As the dataset is big, we can also do data.table methods for subsetting. Convert the 'data.frame' to 'data.table', set the 'key' column and subset the rows where the 'geneSymbol' is either 'SSR1" or "STK38".

library(data.table)
setDT(data, key = "geneSymbol")[.(c("SSR1", "STK38"))]
#   chrom  txStart ExonCount geneSymbol
#1:  chr6  7281287         8       SSR1
#2:  chr6  7295624         8       SSR1
#3:  chr6  7298155         8       SSR1
#4:  chr6 36461668        14      STK38
#5:  chr6 36464487        14      STK38
#6:  chr6 36465556        14      STK38
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    Please study the FAQ [How should duplicate questions be handled?](http://meta.stackexchange.com/questions/10841/how-should-duplicate-questions-be-handled): "In a nutshell: If a question is a duplicate of another question, flag or vote to close."; "Should I answer it? No, not if you think it's a duplicate". – Henrik Aug 28 '16 at 12:29
  • 2
    Please also read [The Wikipedia of Long Tail Programming Questions](https://blog.stackoverflow.com/2011/01/the-wikipedia-of-long-tail-programming-questions/): "**Don't answer questions that have already been answered elsewhere**. Yeah, you might earn a couple of points of reputation, but, because you are duplicating content, _you are actually making the internet worse._". – Henrik Aug 28 '16 at 12:30
  • @Henrik THe OP's dataset is really big and your dupe link doesn't provide that option, where as my solution also have data.table – akrun Aug 28 '16 at 12:33
  • 1
    @ akrun - thanks a million. I had tried the subset but without the %in% i didn't get far!!! – Claudia Aug 30 '16 at 08:29
  • @Henrik, I had looked at various discussions and also went through the suggested questions when formulating my question, but whatever i was looking for, i couldn't find! – Claudia Aug 30 '16 at 08:32