1

I have a dataframe GTs.df of genotypes for 8 genes across 8 different genetic lines. "NA"s represent ambiguous sequencing calls. (There are few heterozygotes "Aa" because these are inbred lines).

GTs.df <- data.frame(Gene = rep(c("Zm1","Zm2","Zm3","Zm4","Zm5","Zm6","Zm7","Zm8"), each=8),
  Line = rep(c("L1", "L2", "L3", "L4", "L5", "L6", "L7", "L8"), times = 8),
  Genotype = c(rep(c("aa", "Aa", "AA", "NA"), times = c(2, 1, 5, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 0, 1, 3)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 1, 3, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(3, 0, 4, 1)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(4, 0, 3, 1)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(5, 1, 2, 0)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(1, 0, 3, 4)),
               rep(c("aa", "Aa", "AA", "NA"), times = c(1, 1, 6, 0))
               )
  )

I want to compare the distribution of genotypes across the lines for each gene, so I make this stacked bar plot initially:

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa"))) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

enter image description here

But the problem is that I want the Genes/columns ordered by number of "aa" so that it is more readable. I can reorder the Genes via fct_reorder as suggested by tbradley / 48748250/FilipW and demonstrated below...

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa")),
         Gene = fct_reorder(Gene,
                            as.numeric(Genotype),
                            .fun = mean)
         ) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

enter image description here

As you can see, this does order the Genes/columns pretty well via sorting by proportion, but this is imperfect in this case because of missing data points and greater than 2 levels. You can see the last Gene (Zm2) has fewer "aa" lines than the Gene before it but does have a higher proportion/mean of "aa".

I also tried a variation of this using sum instead of mean.

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa")),
         Gene = fct_reorder(Gene,
                            as.numeric(Genotype),
                            .fun = sum)
         ) %>%
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

enter image description here

It also almost works, but is still imperfect. Gene Zm4 has fewer "aa"s than the column before it, I guess because Zm4 has more total datapoints to contribute to the sum.

Ideally, I would want to use some sort of count function instead, but neither n or count work for me, no matter what class I change Genotype to. (Many combos so I spared the long, depressing list of error messages).

I did find a non-tidy solution from 48748250/talat that arranges the columns by count/absolute frequency of "aa" as desired:

gene_lvls <- names(sort(table(GTs.df[GTs.df$Genotype == "aa", "Gene"])))

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                  c("AA", "Aa", "aa"))) %>%
  ggplot() +
  aes(x = factor(Gene, 
                 levels = gene_lvls),
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

enter image description here

But I am hoping there's a tidy/dplyr/forcat-friendly way to achieve this, partly for learning/understanding and partly for pickiness/aethetic pleasure. Based on the number of similar forum questions, I have a feeling other people would be pleased by such a solution too. Bonus points if the solution has a secondary filter/tie-breaker when multiple columns have equal number of "aa", as demonstrated by Zm2, Zm3 and Zm5 in the above plot.

Thank you in advance for your time and effort!

Here are some other forum pages that are somewhat related:

R ggplot2 Reorder stacked plot ?

How to control ordering of stacked bar chart using identity on ggplot2

sort columns with categorical variables by numerical varables in stacked barplot

1 Answers1

2

Here's one approach using fct_inorder after calculating some Gene-wise metrics like # of aa an total number of lines. This provides a pretty flexible way of creating whatever sorting metric you want, which could involve multiple tie-breakers.

GTs.df %>%
  filter(Genotype != "NA") %>%
  mutate(Genotype = fct_relevel(Genotype, 
                                c("AA", "Aa", "aa"))) %>%
  group_by(Gene) %>%
  mutate(num_aa = sum(Genotype == "aa"),
         ttl_lines = n()) %>%
  ungroup() %>%
  arrange(num_aa, ttl_lines) %>%        # Define your tie-breakers here
  mutate(Gene = fct_inorder(Gene)) %>%  # Assign factor in order of appearance 
  
  ggplot() +
  aes(x = Gene,
      fill = Genotype) +
  geom_bar(position = "stack",
           stat = "count") + 
  ylab("Number of Lines")

enter image description here

Jon Spring
  • 55,165
  • 4
  • 35
  • 53