Extracting data from dataframe by logic in R

Question

I have a big dataframe (60,000+ rows). I want to create a new dataframe from extracting 10 of the rows which have an exact string match to strings in another dataframe I have. How can I do this in an 'R' way?

The first 5 rows of the big dataframe (saponaria_mean_TPM_gene):

> Saponaria_mean_TPM_gene
# A tibble: 445,547 x 7
   GeneID               Flower Flower_bud Old_leaf     Root     Stem Young_leaf
   <chr>                 <dbl>      <dbl>    <dbl>    <dbl>    <dbl>      <dbl>
 1 TRINITY_DN0_c0_g1  612.       1202.    2282.    5645.    3645.      1740.   
 2 TRINITY_DN1_c0_g1   11.2        10.0     63.6     56.8     18.5       26.7  
 3 TRINITY_DN1_c1_g1    0.0306      0.161    0.719    0.984    5.44       0.174
 4 TRINITY_DN1_c2_g1    0.462       0.641    0.799    0.640    1.23       0.595
 5 TRINITY_DN1_c4_g1    0.327       0.140    1.13     2.43     1.80       1.54

The strings I want to match to (dataframe coex_genes):

1                                                 TRINITY_DN10031_c1_g1
2                                                 TRINITY_DN10042_c0_g1
3                                                 TRINITY_DN10042_c0_g3
4                                                 TRINITY_DN10048_c0_g1
5                                                 TRINITY_DN10058_c0_g1
6                                                 TRINITY_DN10067_c5_g1
7                                                TRINITY_DN100732_c0_g1
8                                                TRINITY_DN100752_c0_g1
9                                                 TRINITY_DN10093_c1_g5
10                                               TRINITY_DN100979_c0_g1

So for example: the row for TRINITY_DN10031_c1_g1 should be

GeneID                Flower Flower_bud Old_leaf  Root  Stem Young_leaf
  <chr>                  <dbl>      <dbl>    <dbl> <dbl> <dbl>      <dbl>
1 TRINITY_DN10031_c1_g1   1.78       2.08        0 0.226 0.544          0

I can get this manually using the code

gene1 <- filter(Saponaria_mean_TPM_gene, (GeneID == "TRINITY_DN10031_c1_g1"))

How can I write a loop (if that's sensible) or something else to find and create a dataframe of the 10 genes in coex_genes?

Do you need `merge` ? `merge(Saponaria_mean_TPM_gene, coex_genes, nu = 'GeneID')` ? — Ronak Shah, Sep 20 '20 at 10:26
Or subset using `%in%` , like `Saponaria_mean_TPM_gene[Saponaria_mean_TPM_gene$GeneID %in% coex_genes[[1]], ]` — Allan Cameron, Sep 20 '20 at 10:28
You can also use left_join, with left dataframe being coex_genes. — Karthik S, Sep 20 '20 at 10:33
@AllanCameron Thanks, this worked once I added ```slice``` ```slice(Saponaria_mean_TPM_gene[Saponaria_mean_TPM_gene$GeneID %in% coex_genes[[1]], ], 1:10)``` — glitterbox, Sep 20 '20 at 10:54

Extracting data from dataframe by logic in R

0 Answers0