8

I have a dataset with some outliers, such as the following

x <- rnorm(1000,0,20)
x <- c(x, 500, -500)

If we plot this on a linear x axis scale at this we see

histogram(x)

non log x-axis

I worked out a nice way to put it on a log scale using this useful thread: how to use a log scale for y-axis of histogram in R? :

mat <- data.frame(x)
ggplot(ee, aes(x = xx)) + geom_histogram(colour="darkblue", size=1, fill="blue") + scale_x_log10()

log x-axis

However, I would like the x axis labels from this 2nd example to match that of the first example, except with a kind of "negative log" - i.e. first tick (moving from the centre to the left) could be -1, then the next could be -10, the next -100, but all equidistant. Does that make sense?

Community
  • 1
  • 1
Jim Bo
  • 657
  • 3
  • 9
  • 16
  • 2
    I don't understand what you're asking. Is it that you want the negative data from the plot (which is currently ignored) to be include somewhere? – David Robinson Jan 24 '13 at 15:33
  • I don't think the negative data is being ignored, I believe it is being plotted? but the value on the x axis is logged? – Jim Bo Jan 24 '13 at 15:52
  • 1
    What exactly do you think you get when you take the log of a negative number...? – joran Jan 24 '13 at 15:58
  • @JimBo: What you are seeing there is only the positive data. While when you look at it on a coarse histogram this data might look like it's normal (or close to it), what you actually have here is reflected log-normal data- that is, two separate distributions, one positive and one negative, that are each (very roughly) log-normally distributed. – David Robinson Jan 24 '13 at 16:00
  • NA - so I guess I'm suggesting you would log the absolute value of the negative numbers, and then plot them on a log scale on the histogram. Does that make sense? As a way of "squashing" the histogram. – Jim Bo Jan 24 '13 at 16:03
  • If that's what you want, then maybe all you need to do is `aes(x = abs(xx))`, although I'm not at all convinced it's a good way to visualize your data. – joran Jan 24 '13 at 16:05
  • @joran, Hmm, I've explained this very badly, what I mean is I want something like the first histogram, but with the xaxis brought in - so it's plotting the same data, but instead of the xaxis marks being -600 -400 -200 0 200 400 600, i would like the marks to be something like -600 -100 -10 -1 0 - 10 100 600... – Jim Bo Jan 24 '13 at 16:08
  • @DavidRobinson oof, you are correct, I repeated with x <-x[x>0] and got the same results. – Jim Bo Jan 24 '13 at 16:14
  • @joran I mean - with the xaxis brought in, and thus the data being more interpretable... – Jim Bo Jan 24 '13 at 16:15

4 Answers4

20

I am not sure I understand your goal, but when you want a log-like transformation yet have zeroes or negative values, the inverse hyperbolic sine transformation asinh() is often a good option. It is log-like for large values and is defined for all real values. See Rob Hyndman's blog and this question on stats.stackexchange.com for discussion, details, and other options.

If this is an acceptable approach, you can create a custom scale for ggplot. The code below demonstrates how to create and use a custom scale (with custom breaks), along with a visualization of the asinh() transformation.

library(ggplot2)
library(scales)

limits <- 100
step <- 0.005
demo <- data.frame(x=seq(from=-1*limits,to=limits,by=step))

asinh_trans <- function(){
  trans_new(name = 'asinh', transform = function(x) asinh(x), 
            inverse = function(x) sinh(x))
}

ggplot(demo,aes(x,x))+geom_point(size=2)+
     scale_y_continuous(trans = 'asinh',breaks=c(-100,-50,-10,-1,0,1,10,50,100))+
     theme_bw()

enter image description here

ggplot(demo,aes(x,x))+geom_point(size=2)+
     scale_x_continuous(trans = 'asinh',breaks=c(0,1,10,50,100))+
     scale_y_log10(breaks=c(0,1,10,50,100))+ # zero won't plot
     xlab("asinh() scale")+ylab("log10 scale")+
     theme_bw()

enter image description here

Michael Barton
  • 8,868
  • 9
  • 35
  • 43
MattBagg
  • 10,268
  • 3
  • 40
  • 47
  • This looks good, I just need to get my head around it and make sure I have the xaxis in the right places! – Jim Bo Jan 24 '13 at 17:04
  • Hmm - it seems to cause a "dip" in the histogram around 0.. If it helps clarify my question, I am looking for: something like the first histogram, but with the xaxis brought in - so it's plotting the same data, but instead of the xaxis marks being -600 -400 -200 0 200 400 600, i would like the marks to be something like -1000 -100 -10 -1 0 - 10 100 1000, with the numbers equal distance apart. As I understand it, this should lead to a more stretched out histogram, and bring in the outliers, but the peak would be around 0? Though I may be wrong. – Jim Bo Jan 24 '13 at 17:23
  • Hmm - what if I perform 10^x and then plot on the log scale - then I just need to rejig the axes.. does that make sense? – Jim Bo Jan 24 '13 at 17:32
  • 1
    No the peak will not be around 0. You will get a dip because you are making unequal width bins and there are fewer items in smaller bins. Even if you make a custom mirrored log scale like I demo'd using ifelse() or another logical test to do different things for positive, zero, and negative values (i.e. -log(-x) for negative and 0 for 0), you will still have a histogram that has two peaks. – MattBagg Jan 24 '13 at 20:55
  • You could calculate the bins and their contents outside of ggplot, maybe using cut() or Hmisc's cut2() and make sure that the bins around zero are wider so you get one peak. I think you would need to use geom_bar() and explicitly pass the x,y, and width rather than having ggplot calculate. – MattBagg Jan 24 '13 at 20:57
  • OK great. this has worked out harder than I expected! I will have a play around and hopefully figure it out, in which case I will post back here! – Jim Bo Jan 25 '13 at 09:05
2

Realizing that the question is fairly old, I decided to answer it anyway since I ran into exactly the same problem.

I see that some answers above misunderstood your original question. I think it is a valid visualization question and I outline below my solution that will hopefully be useful for others as well.

My approach was to use ggplot and create custom log transform for x and y axis (as well as custom break generators)

library(ggplot2)
library(scales)

# Create custom log-style x axis transformer (...,-10,-3,-1,0,1,3,10,...)
custom_log_x_trans <- function()
  trans_new("custom_log_x",
            transform = function (x) ( sign(x)*log(abs(x)+1) ),
            inverse = function (y) ( sign(y)*( exp(abs(y))-1) ),
            domain = c(-Inf,Inf))

# Custom log x breaker (...,-10,-3,-1,0,1,3,10,...)
custom_x_breaks <- function(x)
{ 
  range <- max(abs(x), na.rm=TRUE)

  return (sort( c(0,
                  sapply(0:log10(range), function(z) (10^z) ),
                  sapply(0:log10(range/3), function(z) (3*10^z) ),
                  sapply(0:log10(range), function(z) (-10^z) ),
                  sapply(0:log10(range/3), function(z) (-3*10^z) )
  )))
}

# Create custom log-style y axis transformer (0,1,3,10,...)
custom_log_y_trans <- function()
  trans_new("custom_log_y",
            transform = function (x) ( log(abs(x)+1) ),
            inverse = function (y) ( exp(abs(y))-1 ),
            domain = c(0,Inf))

# Custom log y breaker (0,1,3,10,...)
custom_y_breaks <- function(x)
{ 
  max_y <- length(x)

  range <- max(abs(max_y), na.rm=TRUE)

  return (sort( c(0,
                  sapply(0:log10(range), function(z) (10^z) ),
                  sapply(0:log10(range/3), function(z) (3*10^z) )
  )))
}

ggplot(data=mat) +
  geom_histogram(aes(x=x,fill=..count..), 
                 binwidth = 1, color="black", size=0.1) +
  scale_fill_gradient("Count", low = "steelblue", high = "red") +
  coord_trans(x="custom_log_x",y="custom_log_y") +
  scale_x_continuous(breaks = custom_x_breaks(mat$x)) +
  scale_y_continuous(breaks = custom_y_breaks(mat$x)) +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))  + 
  theme_bw()

which gives me the following plot.

enter image description here

Note that:

  • the plot also includes coloring scheme to show visually the absolute value of each bar.
  • the bins become increasingly thinner as x increases (side effect of log-transform)

In either case, the two outliers are clearly visible.

akhmed
  • 3,536
  • 2
  • 25
  • 35
1

I found a way to cheat on it. I say "cheat", because it actually plot negative and positive parts of the data separately. Thus you can not compare the negative and positive data. But only can show the distribution of negative and positive parts separately.

And one of the problem is if there is zero values in your data, it will not be shown in the plot.

reverselog_trans <- function(base = exp(1)) {
  trans <- function(x) -log(x, base)
  inv <- function(x) base^(-x)
  trans_new(paste0("reverselog-", format(base)), trans, inv, 
            log_breaks(base = base), 
            domain = c(1e-100, Inf))
}

quartz();


dist1 <- ggplot(data=df.meltFUAC) +
  geom_point(alpha=1,aes(x=deltaU.deltaUltrasensitivity,y=deltaF.deltaFitness, 
                         colour=deltaF.w_c)) + 
  scale_x_continuous(name = expression(Delta * S[ult]), 
                     limits=c(1e-7,1),trans = "log10",breaks=c(1e-01,1e-03,1e-05),
                     labels=c("1e-01","1e-03","1e-05")) + 
  scale_y_continuous(name = expression(paste(Delta, " Fitness")),trans = "log10", 
                     limits = c(1e-7,1), breaks=c(1e-01,1e-03,1e-05),
                     labels=c("1e-01","1e-03","1e-05")) +
  theme_bw() +
  theme(legend.position = "none", axis.title.x=element_blank(),strip.background=element_blank(),
        panel.border=element_rect(colour = "black"),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank(),
        plot.margin=unit(c(0,0,0,-11),"mm"))

dist2 <- ggplot(data=df.meltFUAC, aes(x=-deltaU.deltaUltrasensitivity,y=deltaF.deltaFitness, 
                                      colour=deltaF.w_c)) +
  geom_point(alpha=1) + 
  scale_x_continuous(name = expression(Delta * sqrt(S[ult] %.% S[amp])),limits=c(1,1e-7),
                     trans = reverselog_trans(10),breaks=c(1e-01,1e-03,1e-05),
                     labels=c("-1e-01","-1e-03","-1e-05")) +
  scale_y_continuous(name = expression(paste(Delta, " Fitness")),trans = "log10", 
                     limits = c(1e-7,1), breaks=c(1e-01,1e-03,1e-05),
                     labels=c("1e-01","1e-03","1e-05")) +
  theme_bw() +
  theme(legend.position = "none",strip.background=element_blank(),panel.border=element_rect(colour = "black"),
        axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),
        axis.line.y=element_line(colour="black",size=1,linetype="solid"),axis.title.x=element_blank(),
        panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),
        plot.margin=unit(c(0,-8,0,2.5),"mm"))

hist0 <- ggplot(data=df.meltFUAC, aes(deltaF.deltaFitness,fill=deltaF.w_c)) +
  #geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') +
  geom_density(alpha = 0.5, aes(colour=deltaF.w_c)) +
  scale_x_continuous(name = expression(paste(Delta, " Fitness")), 
                     limits=c(1e-7,1),trans = "log10",breaks=c(1e-01,1e-03,1e-05),
                     labels=c("1e-01","1e-03","1e-05")) + 
  scale_y_continuous(name = "Density", limits=c(0,0.6)) + 
  theme_bw() +
  theme(legend.position = "none", axis.title.x=element_blank(),strip.background=element_blank(),
        axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),
        axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.x=element_blank(),
        panel.border=element_rect(colour = "black"),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank(),
        plot.margin=unit(c(0,5,2.5,-2.5),"mm")) +
  coord_flip()

hist1 <- ggplot(data=df.meltFUAC, aes(deltaU.deltaUltrasensitivity,fill=deltaF.w_c)) +
  #geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') +
  geom_density(alpha = 0.5, aes(colour=deltaF.w_c)) +
  scale_x_continuous(name = expression(Delta * S[ult]), 
                     limits=c(1e-7,1),trans = "log10",breaks=c(1e-01,1e-03,1e-05),
                     labels=c("1e-01","1e-03","1e-05")) + 
  scale_y_continuous(name = "Density", limits=c(0,0.6)) + 
  theme_bw() +
  theme(legend.position = "none", axis.title.x=element_blank(),strip.background=element_blank(),
        axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),
        axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.x=element_blank(),
        axis.line.x=element_line(colour="black",size=1,linetype="solid"),
        panel.border=element_rect(colour = "black"),panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),plot.background=element_blank(),
        plot.margin=unit(c(5,0,-2.5,2),"mm"))

hist2 <- ggplot(data=df.meltFUAC, aes(-deltaU.deltaUltrasensitivity,fill=deltaF.w_c)) +
  #geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') +
  geom_density(alpha = 0.5, aes(colour=deltaF.w_c)) +
  scale_x_continuous(name = expression(Delta * S[ult]),limits=c(1,1e-7),
                     trans = reverselog_trans(10),breaks=c(1e-01,1e-03,1e-05),
                     labels=c("-1e-01","-1e-03","-1e-05")) +
  scale_y_continuous(name = "Density", limits=c(0,0.6)) + 
  theme_bw() +
  theme(legend.position = "none",strip.background=element_blank(),panel.border=element_rect(colour = "black"),
        axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),
        axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.x=element_blank(),
        axis.line.y=element_line(colour="black",size=1,linetype="solid"),
        axis.line.x=element_line(colour="black",size=1,linetype="solid"),
        panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),
        plot.margin=unit(c(5,-8,-2.5,2.5),"mm"))



grid.newpage();
pushViewport(viewport(layout = grid.layout(3, 3, widths = unit(c(4,4,2),"null"),
                                           heights=unit(c(2,7.5,0.5),"null"))));
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y);

print(dist2, vp = vplayout(2, 1));
print(dist1, vp = vplayout(2, 2));
print(hist2, vp = vplayout(1, 1));
print(hist1, vp = vplayout(1, 2));
print(hist0, vp = vplayout(2, 3));
grid.text(expression(Delta * Ultrasensitivity),vp = vplayout(3,1:2),x = unit(0.55, "npc"), 
          y = unit(0.9, "npc"),gp=gpar(fontsize=12, col="black"));

dev.copy2pdf(file=sprintf("%s/_dist/dist_hist_deltaF_deltaU_wc_01vs10.pdf", resultDir));
dev.off();

Here is the graph it got (but you need to manually to put the legend on):

enter image description here

Or a simpler one:

reverselog_trans <- function(base = exp(1)) {
  trans <- function(x) -log(x, base)
  inv <- function(x) base^(-x)
  trans_new(paste0("reverselog-", format(base)), trans, inv, 
            log_breaks(base = base), 
            domain = c(1e-100, Inf))
}

quartz();

hist1 <- ggplot(deltaF, aes(deltaFitness,fill=w_c)) + guides(fill=guide_legend(title=expression(omega[c]))) + geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') + geom_density(alpha = 0.05, aes(colour=w_c)) + scale_x_continuous(name = expression(paste(Delta, " Fitness")),trans = "log10");
hist1 <- hist1 + scale_y_continuous(name = "Density", limits=c(0,1));
#hist1 <- hist1 + theme(panel.background=element_blank(),panel.border=element_blank(),axis.line.x=element_blank(),axis.line.y=element_line(colour="black",linetype="solid",size=1),axis.title.x=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(5,5,0,5),"mm"));
hist1 <- hist1 + theme_bw();
hist1 <- hist1 + theme(strip.background=element_blank(),panel.border=element_rect(colour = "black"),axis.title.x=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(5,5,0,5),"mm"));
hist1 <- hist1 + scale_color_discrete(name=expression(omega[c]));# + geom_vline(xintercept=0, colour="grey", size = 1);# + geom_hline(yintercept=0, colour="grey", size = 0.5);

hist2 <- ggplot(deltaU, aes(deltaUltrasensitivity,fill=w_c)) + geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') + geom_density(alpha = 0.05, aes(colour=w_c))  + scale_x_continuous(name = expression(paste(Delta, " Ultrasensitivity")), limits=c(1e-7,1),trans = "log10",breaks=c(1e-01,1e-03,1e-05),labels=c("1e-01","1e-03","1e-05"));
hist2 <- hist2 + scale_y_continuous(name = "Density",limits=c(0,1)) ;#+ geom_vline(xintercept=0, colour="grey", size = 1);# + geom_hline(yintercept=0, colour="grey", size = 0.5);
#hist2 <- hist2 + theme(legend.position = "none", axis.title.x=element_blank(),panel.background=element_blank(),panel.border=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(0,5,0,-7.5),"mm"));
hist2 <- hist2 + theme_bw();
hist2 <- hist2 + theme(legend.position = "none", axis.title.x=element_blank(),strip.background=element_blank(),panel.border=element_rect(colour = "black"),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(0,5,0,-7.5),"mm"));
# + ggtitle("Positive part")

hist3 <- ggplot(deltaU, aes(-deltaUltrasensitivity,fill=w_c)) + geom_histogram(alpha = 0.5, aes(y=..density..),position = 'identity') + geom_density(alpha = 0.05, aes(colour=w_c)) + scale_x_continuous(name = expression(paste(Delta, " Ultrasensitivity")), limits=c(1,1e-7),trans = reverselog_trans(10),breaks=c(1e-01,1e-03,1e-05),labels=c("-1e-01","-1e-03","-1e-05"));
hist3 <- hist3 + scale_y_continuous(name = "Density", limits=c(0,1));# + geom_hline(yintercept=0, colour="black", size = 0.5);
#hist3 <- hist3 + theme(legend.position = "none",panel.background=element_blank(),axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),axis.line.y=element_line(colour="black",size=1,linetype="solid"),axis.title.x=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(0,-7.5,0,5),"mm"));
hist3 <- hist3 + theme_bw();
hist3 <- hist3 + theme(legend.position = "none",strip.background=element_blank(),panel.border=element_rect(colour = "black"),axis.text.y=element_blank(), axis.ticks.y=element_blank(), axis.title.y=element_blank(),axis.line.y=element_line(colour="black",size=1,linetype="solid"),axis.title.x=element_blank(),panel.grid.major=element_blank(),panel.grid.minor=element_blank(),plot.background=element_blank(),plot.margin=unit(c(0,-7.5,0,5),"mm"));
# + ggtitle("Negative part")

grid.newpage();
pushViewport(viewport(layout = grid.layout(4, 2, widths = unit(c(5,5),"null"),heights=unit(c(4.6,0.4,4.6,0.4),"null"))));
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y);
print(hist1, vp = vplayout(1, 1:2));  # key is to define vplayout
grid.text(expression(paste(Delta, " Fitness")),vp = vplayout(2,1:2),x = unit(0.5, "npc"), y = unit(0.9, "npc"),gp=gpar(fontsize=12, col="black"));
print(hist3, vp = vplayout(3, 1));
print(hist2, vp = vplayout(3, 2));
grid.text(expression(paste(Delta, " Ultrasensitivity")),vp = vplayout(4,1:2),x = unit(0.5, "npc"), y = unit(0.9, "npc"),gp=gpar(fontsize=12, col="black"));


dev.copy2pdf(file=sprintf("%s/deltaF_deltaU_wc_01vs10.pdf", resultDir));
dev.off();

Here is the graph I got:

enter image description here

LifeWorks
  • 396
  • 1
  • 4
  • 12
  • w_c which color is that i want to use your colors please? – aliocee Jan 22 '15 at 01:54
  • @aliocee I just used the ggplot. when you choose colour for two data, it will automatic generate such different colours. You can emulate this colour palette by referring: http://stackoverflow.com/questions/8197559/emulate-ggplot2-default-color-palette – LifeWorks Jan 23 '15 at 15:13
-1

Why suffer with ggplot2 solution? Your first plot was done with lattice histogram function, and this is where you should stay. Just apply logarithmic transformation directly within histogram function, use nint argument to specify the number of histogram bins, and type argument to choose between "count", or "density". I think that you got everything you need there, but maybe I'm missing some crucial detail of your question...

library(lattice)
histogram(log10(x), nint=50, type="count")

enter image description here

Geek On Acid
  • 6,330
  • 4
  • 44
  • 64
  • The problem is, the data should be peaking around 0, and by doing log10 you lose all the negative data. What I want is something like the first histogram, but with the axis on a log scale, both negative and positive (I am aware this may not make sense...) basically I want the axis on such a scale so you can see the outliers (the +500/-500) but also you can see the distribution around 0 a bit better. This makes sense in my head but it seems to be causing confusion so it's quite possible that what I'm saying doesn't make sense! – Jim Bo Jan 28 '13 at 10:59