How to subset dataframe in R based on another data

Question

I have a data frame with a lot of RNA seq counts (Sample names as column names and genes as row names), and a file of metadata i.e. sex, tissue type, disease status etc. (sample names as row names and sex etc and column names) I would like to a subset of the RNAseq counts data that just contains 2 of the tissues types, so that I can look at DGE. Could someone suggest the best way to do this? I'm very new at working with RNA seq data so this may be obvious!

Thank you!

Edit: There are >1000 samples so it would likely not be accurate to subset out the columns by their column names

hope this gives some insight into counts data:

dput(head(tpm.df[1:2])) 
structure(list(Description = c("DDX11L1", "WASH7P", "MIR6859-1", 
"MIR1302-2HG", "FAM138A", "OR4G4P"), `GTEX-1117F-0226-SM-5GZZ7` = c(0L, 
187L, 0L, 1L, 0L, 0L)), row.names = c("ENSG00000223972.5", 
"ENSG00000227232.5", 
"ENSG00000278267.1", "ENSG00000243485.5", "ENSG00000237613.2", 
"ENSG00000268020.3"), class = "data.frame")

and this is the metadata:

structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), SMCENTER = c("B1", 
"B1", "B1", "B1, A1", "B1, A1", "B1"), SMPTHNTS = c("", "", "", 
"", "", "2 pieces, ~15% vessel stroma, rep delineated")), row.names = 
c("GTEX-1117F-0003-SM-58Q7G", 
"GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F- 
0011-R10a-SM-AHZ7F", 
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class = 
"data.frame")

If you were following [this tutorial](https://www.reneshbedre.com/blog/edger-tutorial.html), could you perhaps select the columns of interest using the same method? I.e. `subset <- count_matrix[, c(1,2,3,7,8,9)]` — jared_mamrot, Jul 19 '22 at 10:16
Thanks for this! There are a lot of samples so I'm not sure if this would be very accurate? I haven't seen this tutorial before so will have a look through at how they work with the data — Ella, Jul 19 '22 at 10:35
It's a lot easier to understand your problem and help you troubleshoot if you can provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). I understand it's difficult with relatively large bioinformatics files, but adding e.g. the output of `dput(head(count_matrix))` to your question would make it easier to work out what's going on and why you're having problems — jared_mamrot, Jul 19 '22 at 10:39
Thank you, I'm new at this and couldn't work out how to show data — Ella, Jul 19 '22 at 11:01
You're welcome; for further advice on how best to ask questions in this forum see [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) or check out https://bioinformatics.stackexchange.com/. Also, thanks for adding sample data to your question. I've added an answer below, but if it's not applicable to your actual data, please let me know and I will edit it — jared_mamrot, Jul 19 '22 at 11:53

score 0 · Accepted Answer · answered Jul 19 '22 at 11:51

Do you have a "Tissue" column in your "metadata" dataframe? If so, you can use this to subset your "metadata" dataframe and then use that to subset your tpm values, e.g.

tpm.df <-
  structure(
    list(
      Description = c(
        "DDX11L1",
        "WASH7P",
        "MIR6859-1",
        "MIR1302-2HG",
        "FAM138A",
        "OR4G4P"
      ),
      `GTEX-1117F-0226-SM-5GZZ7` = c(0L, 187L, 0L, 1L, 0L, 0L)
    ),
    row.names = c(
      "ENSG00000223972.5",
      "ENSG00000227232.5",
      "ENSG00000278267.1",
      "ENSG00000243485.5",
      "ENSG00000237613.2",
      "ENSG00000268020.3"
    ),
    class = "data.frame"
  )

metadata <- structure(list(SMATSSCR = c(NA, NA, NA, NA, NA, 0L), 
                           SMCENTER = c("B1", "B1", "B1", "B1, A1", "B1, A1", "B1"), 
                           SMPTHNTS = c("", "", "",  "", "", "2 pieces, ~15% vessel stroma, rep delineated"),
                           TISSUE = c("Adipose", "Skin", "Adipose", "Muscle", "Skin", "Nerve")),
                      row.names = c("GTEX-1117F-0003-SM-58Q7G", "GTEX-1117F-0003-SM-5DWSB", "GTEX-1117F-0003-SM-6WBT7", "GTEX-1117F-0011-R10a-SM-AHZ7F", 
"GTEX-1117F-0011-R10b-SM-CYKQ8", "GTEX-1117F-0226-SM-5GZZ7"), class = 
  "data.frame")

tpm.df
#>                   Description GTEX-1117F-0226-SM-5GZZ7
#> ENSG00000223972.5     DDX11L1                        0
#> ENSG00000227232.5      WASH7P                      187
#> ENSG00000278267.1   MIR6859-1                        0
#> ENSG00000243485.5 MIR1302-2HG                        1
#> ENSG00000237613.2     FAM138A                        0
#> ENSG00000268020.3      OR4G4P                        0
metadata
#>                               SMATSSCR SMCENTER
#> GTEX-1117F-0003-SM-58Q7G            NA       B1
#> GTEX-1117F-0003-SM-5DWSB            NA       B1
#> GTEX-1117F-0003-SM-6WBT7            NA       B1
#> GTEX-1117F-0011-R10a-SM-AHZ7F       NA   B1, A1
#> GTEX-1117F-0011-R10b-SM-CYKQ8       NA   B1, A1
#> GTEX-1117F-0226-SM-5GZZ7             0       B1
#>                                                                   SMPTHNTS
#> GTEX-1117F-0003-SM-58Q7G                                                  
#> GTEX-1117F-0003-SM-5DWSB                                                  
#> GTEX-1117F-0003-SM-6WBT7                                                  
#> GTEX-1117F-0011-R10a-SM-AHZ7F                                             
#> GTEX-1117F-0011-R10b-SM-CYKQ8                                             
#> GTEX-1117F-0226-SM-5GZZ7      2 pieces, ~15% vessel stroma, rep delineated
#>                                TISSUE
#> GTEX-1117F-0003-SM-58Q7G      Adipose
#> GTEX-1117F-0003-SM-5DWSB         Skin
#> GTEX-1117F-0003-SM-6WBT7      Adipose
#> GTEX-1117F-0011-R10a-SM-AHZ7F  Muscle
#> GTEX-1117F-0011-R10b-SM-CYKQ8    Skin
#> GTEX-1117F-0226-SM-5GZZ7        Nerve

# One way to find samples of interest
subset_adipose_samples <- metadata[metadata$TISSUE %in% c("Adipose"),]
subset_adipose_samples
#>                          SMATSSCR SMCENTER SMPTHNTS  TISSUE
#> GTEX-1117F-0003-SM-58Q7G       NA       B1          Adipose
#> GTEX-1117F-0003-SM-6WBT7       NA       B1          Adipose
adipose_samples <- rownames(subset_adipose_samples)
adipose_samples
#> [1] "GTEX-1117F-0003-SM-58Q7G" "GTEX-1117F-0003-SM-6WBT7"

subset_skin_samples <- metadata[metadata$TISSUE %in% c("Skin"),]
subset_skin_samples
#>                               SMATSSCR SMCENTER SMPTHNTS TISSUE
#> GTEX-1117F-0003-SM-5DWSB            NA       B1            Skin
#> GTEX-1117F-0011-R10b-SM-CYKQ8       NA   B1, A1            Skin
skin_samples <- rownames(subset_skin_samples)
skin_samples
#> [1] "GTEX-1117F-0003-SM-5DWSB"      "GTEX-1117F-0011-R10b-SM-CYKQ8"

subset_tpm.df <- tpm.df[c(adipose_samples, skin_samples)]
#> Error in `[.data.frame`(tpm.df, c(adipose_samples, skin_samples)): undefined columns selected

^{Created on 2022-07-19 by the reprex package (v2.0.1)}

NB. This example returns an error with your sample dataset because "tpm.df" only has one column, but I'm relatively sure it would work with your actual data

Hi I have tried your suggestion with full data but am getting the same "undefined columns selected" error - do you have any suggestions? Thanks very much for your help! — Ella, Jul 21 '22 at 14:44
Are you sure your metadata matches your tpm.df counts data? I.e. do you have the same 'GTEX' samples in each? — jared_mamrot, Jul 21 '22 at 22:47
I have checked and they do match - I think this may be irrelevant but to check does it make a difference that when I print the GTEX names from metadata they are in "" and they are not in the dataframe — Ella, Jul 22 '22 at 07:40
Edit: I have just realised - I had samples in the metadata that weren't in the counts which was causing issues, I just had to do one previous subset and it worked! Thanks so much for your help — Ella, Jul 22 '22 at 08:37

How to subset dataframe in R based on another data

1 Answers1

Linked