0

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:

a=data[data$rn %in% y, "Gene"]

To pull out information into a new vector. Now I have a another job Id like to do. I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.

To make this extra annoying the hit could be in v3 and not in v9 and visa versa.

Working example

I have striped the dataframe to 3 cols and few rows.

data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")

y <- c("ibp", "orf1")

AudileF
  • 436
  • 2
  • 10
  • Please show a small reproducible example. It may not work because the column names could be different and `rbind` requires the same column names – akrun Apr 27 '17 at 11:03
  • Welcome to StackOverflow! Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap Apr 27 '17 at 11:07
  • Well it give me the coordinates. Ill be back with an example. – AudileF Apr 27 '17 at 11:08
  • 2
    If you want to print a data.frame in return, why not use `data[data$rn %in% y, c("Gene", "hit")]` – talat Apr 27 '17 at 11:10
  • @docendodiscimus it throws back an error saying undefined columns selected. – AudileF Apr 27 '17 at 11:15
  • In this case we really need an example of your data, there seems to be something going on with your data structure – Sarina Apr 27 '17 at 11:18
  • 1
    Btw, I think you are misinterpreting your code. It doesn't check anything in columns gene and hit. It only returns those columns if the rn column is in y. – talat Apr 27 '17 at 11:18

1 Answers1

1

First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to

y <- c("ibp", "ORF1")

Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:

new.data<-data[data$Gene %in% y|data$hit %in% y,]

if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:

new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]
Sarina
  • 548
  • 3
  • 10
  • Hi Sarina, This works great. Thanks a mill. Just a quick question. Is there a way to get it to return partial matches? Some of the names in my Vector may look like e.g. hokB_2. If hokB was in the table I assume it would not match. – AudileF Apr 27 '17 at 12:56
  • As far as I know the `%in%` command is not meant to be used for partial matching. I am not familiar with this kind of extraction so I can't help you with that. Is the number of different subcases so long that you can't add them to your y? you can also use `unique(data$Gene)` to access a list with all possible levels you can find in your dataset, maybe that helps – Sarina Apr 27 '17 at 13:20