3

I have a dataframe of ~108m rows of data, in 7 columns. I use this R script to make a boxplot of it:

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  scale_y_log10() +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  ylab(expression(Exposure~to~NO[x])) + 
  xlab(expression(Hour~of~the~day)) +
  ggtitle("Hourly exposure to NOx") +
  theme(axis.text=element_text(size=12, colour="black"),
        axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"),
        legend.position="none")

The graph looks like this. It's pretty much fine, however it would be better to have a value towards the top of the Y axis. I guess it should be something like 1000 given the Y axis is a log10 scale. I'm not sure how to do this though?

enter image description here

Any ideas please?

EDIT: In response to DrDom: Try to add scale_y_log10(breaks=c(0,10,100,1000)). The output of doing that, is this:

enter image description here

The output of doing the following: scale_y_log10(breaks=c(0,10,100,1000), limits=c(0,1000))

Is an error of:

Error in seq.default(dots[[1L]][[1L]], dots[[2L]][[1L]], length = dots[[3L]][[1L]]:
'from' cannot be NA, NaN or infinite

In respnonse to Jaap who suggested the following code:

library(ggplot2)
library(scales)

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  scale_y_continuous(breaks=c(0,10,100,1000,3000), trans="log1p") +
  labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
  theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"), legend.position="none")

It produces this graph. Have I done something wrong? I'm still missing a '1000' tick label? A tick inbetween the 10 and the 100 would also be good given that is where most of the data is?

enter image description here

TheRealJimShady
  • 777
  • 3
  • 9
  • 24
  • Try to add `scale_y_log10(breaks=c(0,10,100,1000))` or `scale_y_log10(breaks=c(0,10,100,1000), limits=c(0,1000))` – DrDom Jul 09 '14 at 15:49
  • Hi DrDom. Thanks for your suggestions. The results are added to my post above. Note that the data looks a bit different as running the graph creation again takes about 20 minutes, so I just used a subset of the data for demonstration purpose. Not quite there yet. :-( – TheRealJimShady Jul 09 '14 at 16:13
  • I have written a function which does that automatically: https://stackoverflow.com/a/54325289/3082472 – akraf Jan 23 '19 at 10:42

2 Answers2

4

You can modify your log scale by adding arguments breaks= to scale_y_log10(), only there shouldn't be a 0 value because from those values also log is calculated.

df<-data.frame(x=1:10000,y=1:10000)
ggplot(df,aes(x,y))+geom_line()+
      scale_y_log10(breaks=c(1,5,10,85,300,5000))
Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
3

Instead of using scale_y_log10 you can also use scale_y_continuous together with a log transformation from the scales package. When you use the log1p transformation, you are also able to include a 0 in your breaks: scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p")

Your complete code will then look like this (notice that I also combined the title arguments in labs):

library(ggplot2)
library(scales)

ggplot(expanded_results, aes(factor(hour), dynamic_nox)) +
  geom_boxplot(fill="#6699FF", outlier.size = 0.5, lwd=.1) +
  stat_summary(fun.y=mean, geom="line", aes(group=1, colour="red")) +
  scale_y_continuous(breaks=c(0,1,3,10,30,100,300,1000,3000), trans="log1p") +
  labs(title="Hourly exposure to NOx", x=expression(Hour~of~the~day), y=expression(Exposure~to~NO[x])) +
  theme(axis.text=element_text(size=12, colour="black"), axis.title=element_text(size=12, colour="black"),
        plot.title=element_text(size=12, colour="black"), legend.position="none")
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • I thought that it should read trans="log10" instead of trans="log1p" Can you clarify? – TheRealJimShady Jul 10 '14 at 10:30
  • @TheRealJimShady You can use both. However when your `dynamic_nox` has values of `0`, these observations will be excluded. `log1p` will add `1` to the value and then calculate the log-value. In this way also values of `0` will be included. See [this answer for an example](http://stackoverflow.com/questions/24646594/how-to-improve-the-aspect-of-ggplot-histograms-with-log-scales-and-discrete-valu/24649522#246495220) – Jaap Jul 10 '14 at 10:54
  • Hi Jaap. Thanks again for your guidance. I've put the output of your suggestions in my question above. Would you take a glance please? I'm missing some tick marks still. Not sure what to do about that? Thank you. – TheRealJimShady Jul 10 '14 at 11:25
  • @TheRealJimShady Can you include a `dput` of (part of) your data in your question ([see here for an explanation how to do that](http://stackoverflow.com/a/5963610/2204410))? That would make it much easier for me to diagnose the problem. – Jaap Jul 10 '14 at 12:01
  • 1
    I think I've figured it out now actually. I've done this and it seems to work well: scale_y_continuous(breaks=c(0,10,100,300,1000,3000), trans="log1p", labels=c(0,10,100,300,1000,3000)) – TheRealJimShady Jul 10 '14 at 13:03