0

I'm making a the following bar plot with ggplot:

df  %>% ggplot( aes(x= group,y= cases,fill=color   ) ) + 
  geom_bar(stat="identity") + 
  theme_minimal()

Which gives the following result:

enter image description here

The issue is that the smaller colors are not visible, hence I tried to use a log scale:

df  %>% ggplot( aes(x= group,y= cases,fill=color   ) ) + 
  geom_bar(stat="identity") + 
  scale_y_log10(labels = comma) +  
  theme_minimal()

enter image description here

But this completelly broke the scales, now I´m getting a 10 MM value from nowhere and bar sizes are wrong

The data I´m ussing for this is the following:

index,group,color,cases
1,4,4,9
2,4,3,61
3,1,1,5000
4,4,2,138
5,4,1,246
6,3,1,359
7,2,1,2000
8,3,2,57
9,1,2,153
10,2,2,130
11,2,3,15
12,1,3,23
13,3,3,11
14,2,4,1
Luis Ramon Ramirez Rodriguez
  • 9,591
  • 27
  • 102
  • 181
  • 2
    When you are stacking a bar chart with a logarithmic scale you are solving this math problem: log(a) + log(b) + log(c) =log(a*b*c) thus your 10 million values. – Dave2e Apr 16 '20 at 18:29

1 Answers1

1

TL;DR: You cannot and should not use a log scale with a stacked barplot. If you want to use a log scale, use a "dodged" barplot instead. You'll also have better luck to use geom_col instead of geom_bar here and set your fill= variable as a factor.

Geom_col vs. geom_bar

Try using geom_col in place of geom_bar. You can use coord_flip() if the direction is not to your liking. See here for reference, but the gist of the issue is that geom_bar should be used when you want to plot against "count", and geom_col should be used when you want to plot against "values". Here, your y-axis is "cases" (a value), so use geom_col.

The Problem with log scales and Stacked Barplots

With that being said, u/Dave2e is absolutely correct. The plot you are getting makes sense, because the underlying math being done to calculate the y-axis values is: log10(x) + log10(y) + log10(z) instead of what you expected, which was log10(x + y + z).

Let's use the numbers in your actual data frame for comparison here. In "group 1", you have the following:

index group color cases
    3     1     1  5000
    9     1     2   153
   12     1     3    23

So on the y-axis what's happening is the total value of a stacked barplot (without a log scale) will be the sum of all. In other words:

> 5000 + 153 + 23
[1] 5176

This means that each of the bars represents the correct relative size, and when you add them up (or stack them up), the total size of the bar is equivalent to the total sum. Makes sense.

Now consider the same case, but for a log10 scale:

> log10(5000) + log10(153) + log10(23)
[1] 7.245389

Or, just about 17.5 million. The total height of the bar is still the sum of all individual bars (because that's what a stacked barplot is), and you can still compare the relative sizes, but the sum total of the individual logs does not equal the log of the sum:

>log10(5000 + 153 + 23)
[1] 3.713994

Suggested Way to Change your Plot

Moral of the story: you can still use a log scale to "stretch out" the small bars, but don't stack them. Use postion='dodge':

df  %>% ggplot( aes(x= group,y= log10(cases),fill=as.factor(color)   ) ) + 
    geom_col(position='dodge') + 
    theme_minimal()

enter image description here

Finally, position='dodge' (or position=position_dodge(width=...)) does not work with fill=color, since df$color is not a factor (it's numeric). This is also why your legend is showing a gradient for a categorical variable. That's why I used as.factor(color) in the ggplot call here, although you can also just apply that to the original dataset with df$color <- as.factor(df$color) and do the same thing.

Community
  • 1
  • 1
chemdork123
  • 12,369
  • 2
  • 16
  • 32
  • I´m getting the same result. The dataset I´m using is in the question at the end. – Luis Ramon Ramirez Rodriguez Apr 16 '20 at 19:32
  • 2
    You're right - take u/Dave2e's comment to heart here, because the moral of the story is to not use log scale with STACKED barplots: use a dodged plot instead. You should be switching to `geom_col` and also need to set your `fill=` aesthetic as a factor for proper formatting. See the edits made to the answer, which should now be quite complete. – chemdork123 Apr 16 '20 at 20:47