1

I have a (biological) data frame of gene abundances and the metabolic processes they contribute to.

> head(as.data.frame(df))
  Total_abundance                     process1 process10 process11 process12 process13
1        53132920 Glycolysis / Gluconeogenesis         0         0         0         0
2        35708645        Pyrimidine metabolism         0         0         0         0
3        33620967        Arginine biosynthesis         0         0         0         0
4        26119946       Fatty acid degradation         0         0         0         0
5        26119946       Fatty acid degradation         0         0         0         0
6        20600274       Fatty acid degradation         0         0         0         0
                                     process2                        process3           process4              process5
1                         Pyruvate metabolism           Propanoate metabolism Metabolic pathways     Carbon metabolism
2                   Selenocompound metabolism                               0                  0                     0
3 Alanine, aspartate and glutamate metabolism             Nitrogen metabolism Metabolic pathways                     0
4                        Butanoate metabolism              Metabolic pathways  Carbon metabolism Fatty acid metabolism
5                        Butanoate metabolism              Metabolic pathways  Carbon metabolism Fatty acid metabolism
6  Valine, leucine and isoleucine degradation alpha-Linolenic acid metabolism Metabolic pathways Fatty acid metabolism
  process6 process7 process8 process9
1        0        0        0        0
2        0        0        0        0
3        0        0        0        0
4        0        0        0        0
5        0        0        0        0
6        0        0        0        0

In this data frame I obtained, unfortunately some of the genes contribute to more than one metabolic process (if they only contribute to one process, the other columns processX has the number 0).

Currently, I am plotting only the first column, but I would like to integrate the other processes as well. This is how I am currently plotting the data:

df %>%
  ggplot(aes(x = process1, y = Total_abundance, fill = process1)) +
  geom_bar(stat = "identity")

But this is only for process1, I am ignoring all the other columns. How can I integrate the other columns (where they are not 0)? I thought of reshaping the data frame but I am not sure how to do this.

Thank you. :-)

Revan
  • 2,072
  • 4
  • 26
  • 42
  • 3
    Yes, reshape your data using `reshape2::melt` or `tidyr::gather`. Here's the FAQ [on reshaping data from wide to long](https://stackoverflow.com/q/2185252/903061). You want a `process_number` columns withe values 1 to 11 and a `process_name` column with whatever values are in all your process columns currently. – Gregor Thomas Apr 09 '18 at 17:19
  • 1
    #1 rule of ggplot2 is to always, always, always convert your dataframe to long format :) – user5359531 Apr 09 '18 at 17:39
  • Or reshape with `base::reshape` – Parfait Apr 09 '18 at 17:44

0 Answers0