0

I was following this post R - sum value and merge column based on condition, but it only works if you have one column of values that you want to merge.

In my case, I have the following dataframe and I am unable to figure out how to merge all the values in each row for EACH sample (without having to write out each sample name, I have about 80 samples) given a condition of a column.

Here is a sample of my dataframe:

df <- data.frame(OTU_ID = c(123,456,789,101,102,232,221),
                 Kingdom = rep(c("Viridiplantae"),7),
                 Phylum = rep(c("Streptophyta"),7),
                 Class = rep(c("Magnoliopsida"),7),
                 Order = c("Asterales","Asterales","Fabales","Fabales","Dipsacales","Asterales","Fabales"),
                 Family = c("Asteraceae","Asteraceae","Fabaceae","Fabaceae","Caprifoliaceae","Asteraceae","Fabaceae"),
                 Genus = c("Gymnanthemum","Gymnanthemum","Trifolium","Trifolium","Hypericum","Felicia","Trifolium"),
                 Species = c("amygdalinum","amygdalinum","pannonicum","pratense","perforatum","heterophylla","pannonicum"),
                 Sample1 = c(436,0,0,1167,37704,0,0),
                 Sample2 = c(1146,0,0,285,38489,0,0),
                 Sample3 = c(19547,0,0,87,13732,0,0),
                 Sample4 = c(564,0,0,0,34821,0,0),
                 Sample5 = c(579,0,0,0,0,17632,0),
                 Sample6 = c(0,366,50,0,0,30457,100))

This is what the dataframe looks like:

OTU_ID       Kingdom       Phylum         Class      Order         Family        Genus      Species Sample1 Sample2 Sample3
1    123 Viridiplantae Streptophyta Magnoliopsida  Asterales     Asteraceae Gymnanthemum  amygdalinum     436    1146   19547
2    456 Viridiplantae Streptophyta Magnoliopsida  Asterales     Asteraceae Gymnanthemum  amygdalinum       0       0       0
3    789 Viridiplantae Streptophyta Magnoliopsida    Fabales       Fabaceae    Trifolium   pannonicum       0       0       0
4    101 Viridiplantae Streptophyta Magnoliopsida    Fabales       Fabaceae    Trifolium     pratense    1167     285      87
5    102 Viridiplantae Streptophyta Magnoliopsida Dipsacales Caprifoliaceae    Hypericum   perforatum   37704   38489   13732
6    232 Viridiplantae Streptophyta Magnoliopsida  Asterales     Asteraceae      Felicia heterophylla       0       0       0
7    221 Viridiplantae Streptophyta Magnoliopsida    Fabales       Fabaceae    Trifolium   pannonicum       0       0       0
  Sample4 Sample5 Sample6
1     564     579       0
2       0       0     366
3       0       0      50
4       0       0       0
5   34821       0       0
6       0   17632   30457
7       0       0     100

I want to merge all unique genera or family so that I end up with a new dataframe with merged raw read counts for each unique genus or family.

My expected outcome would be a dataframe like the following:

Kingdom         Phylum          Class           Order       Family          Genus           Sample1    Sample2      Sample3     Sample4      Sample5       Sample6
Viridiplantae   Streptophyta    Magnoliopsida   Asterales   Asteraceae      Gymnanthemum    436         1146        19547       564          579           366
Viridiplantae   Streptophyta    Magnoliopsida   Fabales     Fabaceae        Trifolium       1167        285          87         0             0             150
Viridiplantae   Streptophyta    Magnoliopsida   Dipsacales  Caprifoliaceae  Hypericum       37704       38489       13732       34821         0             0
Viridiplantae   Streptophyta    Magnoliopsida   Asterales   Asteraceae      Felicia     0           0           0           0             17632         30457

The final dataframe, if I chose to combine at genus level, would just end up with 4 rows where all Trifolium sp. and Gymnanthemum sp. read counts were summed up for each sample. Even though Trifolium has 2 unique species, it shouldn't matter because the condition is at genus level. If I had chosen Family level, Gymnanthemum and Felicia read counts should be summed up instead across all samples.

And then, if I wanted to, I could make a separate one at family level.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    Using dplyr, this is simply `group_by` and `summarize` across all but two columns: `library(dplyr); df %>% group_by(Kingdom, Phylum, Class, Order, Family, Genus) %>% summarize(across(-c(OTU_ID, Species), sum))`. This can be done in base R with a little more effort. – r2evans May 03 '23 at 12:32
  • I get an error saying 'no applicable method for 'group_by' applied to an object of class "function"' – Katherine Chau May 03 '23 at 12:35
  • The code above and the dupe-links should be sufficient to resolve your issue. If you look at the dupe-links and neither my code and the links seem to be enough Katherine Chau, please @-ping me and we can work through the differences, reopening if it makes sense. – r2evans May 03 '23 at 12:35
  • That error suggests that your real data is not named `df`. – r2evans May 03 '23 at 12:36
  • I am using the correct df name, in my case it's just called data instead of df. Still giving the error. But I will try the other posted answers above I must have missed those. – Katherine Chau May 03 '23 at 12:37
  • 1
    I understand. That error suggests that whatever object name you are using, it is referring to a function instead of actual object-data. It's often considered bad-practice (though not "evil") to use common base-R function names as names for your data objects (e.g., `df`, `data`, `dt`), since errors on reproduction due to missing data are obscure. If your _real_ frame is named `data`, then the code would be `data %>% group_by(...) %>% ...`, but I'm assuming for that that `inherits(data, "data.frame")` is true (and that you have loaded dplyr with `library(dplyr)`). – r2evans May 03 '23 at 12:38
  • 1
    Oh I see, yes that was the error then. I changed it to df and that did the trick. Thanks so much! – Katherine Chau May 03 '23 at 12:40

0 Answers0