0

I need to extract a subset of large dataset using a list of keywords. Large dataset(gene_infoNCBI) is shown here contains keywords

> head(gene_infoNCBI)
  X.tax_id  GeneID   Symbol  LocusTag Synonyms dbXrefs chromosome map_location
1        7 5692769 NEWENTRY         -        -       -          -            -
2        9 1246500    At1g00930 pLeuDn_01        -       -          -            -
3        9 1246501    repA2 At1g13580        -       -          -            -
4        9 1246502     leuA pLeuDn_04        -       -          -            -
5        9 1246503     leuB pLeuDn_05        -       -          -            -
6        9 1246504     leuC pLeuDn_06        -       -          -            -
                                                                                                                                                                                                 description
1 Record to support submission of GeneRIFs for a gene not in Gene (Azotirhizobium caulinodans.  Use when strain, subtype, isolate, etc. is unspecified, or when different from all specified ones in Gene.).
2                                                                                                                                                                    putative replication-associated protein
3                                                                                                                                                                    putative replication-associated protein
4                                                                                                                                                                                 2-isopropylmalate synthase
5                                                                                                                                                                            3-isopropylmalate dehydrogenase
6                                                                                                                                                                    isopropylmalate isomerase large subunit
    type_of_gene Symbol_from_nomenclature_authority Full_name_from_nomenclature_authority
1          other                                  -                                     -
2 protein-coding                                  -                                     -
3 protein-coding                                  -                                     -
4 protein-coding                                  -                                     -
5 protein-coding                                  -                                     -
6 protein-coding                                  -                                     -
  Nomenclature_status Other_designations Modification_date Feature_type
1                   -                  -          20190202            -
2                   -                  -          20180129            -
3                   -                  -          20180129            -
4                   -                  -          20180129            -
5                   -                  -          20180129            -
6                   -                  -          20180129            -

keyword.txt has the keyword which match the column values of "Symbol" and "LocusTag" values of gene_infoNCBI file.

1              At1g00930          NA NA
2              At1g00930          NA NA
3              At1g00930          NA NA
4              At1g00930          NA NA
5              At1g00930          NA NA
6              At1g13580          NA NA
shahzad
  • 23
  • 5
  • Please do not post an image of code/data/errors: it cannot be copied or searched (SEO), it breaks screen-readers, and it may not fit well on some mobile devices. Ref: https://meta.stackoverflow.com/a/285557/3358272 (and https://xkcd.com/2116/). Please just include the code or data (e.g., `dput(head(x))` or `data.frame(...)`) directly. – r2evans Oct 27 '19 at 17:59
  • Additionally, it is not clear how your `Keyword.txt` values are supposed to fit in with the image of your data. Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`), and expected output. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. – r2evans Oct 27 '19 at 18:05

1 Answers1

1

Not much to go on here but you could do something like this:

library(tidyverse)

set.seed(10)

keywords <- c("a", "c", "d", "e", "f")
key_vec <- str_c(keywords, collapse = "|")

dat <- tibble(z = seq(1, 100, 1), 
              y = runif(100, 0, 50), 
              x = sample(letters, 100, replace = T))

dat %>% 
  filter(str_detect(x, key_vec))

billash
  • 172
  • 1
  • 7