0

I am looking for a way to plot the distribution of the mean values of one variable across bins of log2 values of another variable (which has positive and negative values), exploiting the more complicated functions in ggplot2. I think I am majorly complicating this and it is probably hard coded in ggplot2 refined options, but I cannot get it right so before going back to the basics I thought I may try to learn how to apply these functions here.

value <- rnorm(1000,0,20)
dist = c(rep(0, 15), sample(1:490), sample(-1:-495))
data = data.frame(value=value, dist=dist)

data$log=log2(abs(data$dist)+1)
# re-lable the x-axis: 
data$Labels=2^(abs(data$log))-1

data$bins=cut(data$log, breaks=10)
# Try to recover the negative log after transformation
data$sign=ifelse(data$dist==0, 0, ifelse(data$dist>0, "+", "-"))

# find the average expression of value per each bin
data=with(data, aggregate(data$value, by = list(bins, sign), FUN =    function(x) c(mn =mean(x), n=length(x) )))
data= as.data.frame(as.list(data))
names(data)=c("bins", "sign", "mean", "length")

# I am doing this in a very contorted way to try to achieve what I would like which is something like this:

bin_num = do.call("rbind", lapply(strsplit(sapply(as.character(data$bins), function(x) substr(x, 2, nchar(x)-1)), ","), as.numeric))
data$bin_num=bin_num[,1]
data$bin_num=ifelse(data$sign==0, 0, ifelse(data$sign=="-", -data$bin_num, data$bin_num))
data = data[order(data$bin_num),]

data <- transform(data, x2 = factor(paste(sign, bins)))
data <- transform(data, x2 = reorder(x2, rank(bin_num)))

# Line plot to show the distribution of the means across the bins of log2 of x:
ggplot(data, aes(y = mean, x = bin_num, group=1)) +  geom_point() + geom_line()

# Then I am trying to re-label the logarithmic transformations here by adding labels, but of course it is not working:

ggplot(data, aes(y = mean, x = bin_num, group=1)) +  geom_point() + geom_line() + scale_x_discrete(labels=data$dist, breaks=data$bin_num)

I see that ggplot2 has functionalities to directly compute the mean so I maybe would not even need the previous commands. I tried:

ggplot(data, aes(x = bins, y = mean)) + stat_summary(fun.y = "mean") +     geom_line() + scale_x_continuous(breaks = labels)

But of course it doesn't work... I also saw that the ggplo2 has functions to automatically help with logarithmic labelling instead of what I used here, but I don't see how to do this when there are negative values to be logged. There is a very nice function from another question here which converts the two values, but I don't see it useful at this stage. Thanks very much for any suggestions on how to go about this...really appreciated!

Community
  • 1
  • 1
user971102
  • 3,005
  • 4
  • 30
  • 37
  • Can you share a picture/drawing of what you try to achieve. – David Nov 10 '15 at 10:33
  • Hi David, I added a try which roughly does what I am trying to achieve, but it is very contorted... I hope there is a better way... – user971102 Nov 10 '15 at 13:30
  • TBH, I am absolutely lost and have no idea of what you are trying to achieve... So on the x-axis you want to have the bin number, the y-axis depicts the mean value of that bin. Say we see a point at (-10, 3) that means that in bin -10, the mean value of the variable is 3?! What about the bin size? And what with the log-transformation? – David Nov 10 '15 at 14:20
  • Thank you for your help and sorry for not being clear..Yes point (-10, 3) should show as (2^(-10), 3). I would like to 1) transform my x variable in log scale, 2) split my log x variable into a number of bins, 3) find the mean value of y for each bin, 4) plot with y= (the mean value of y for each bin of log x) and x = (the bins, but labelled with the origninal non-logarithmic values)... I thought these were sort of normal procedures for binning when the x variable is very large and needs to be transformed into logarithmic scale to better fit the data, but maybe this is not a usual procedure? – user971102 Nov 10 '15 at 14:48
  • Ok, I am trying to go through your stuff now. In the meantime, can you correct your example, neither `x`, nor `y` is found, I assumed it to be `value` and `dist` – David Nov 10 '15 at 15:10
  • Thank you David...just corrected that... – user971102 Nov 10 '15 at 15:32
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/94732/discussion-between-david-and-user2183097). – David Nov 10 '15 at 16:14

1 Answers1

1

First version of an answer, using data.table for faster speeds and better readability:

The code reproduces the question with shorter and faster code

library(data.table)

# function that returns the lower bound of a cut
lower.bound <- function(x, n) {
  c <- cut(x, n)
  tmp <- substr(x = c, start = 2, stop = regexpr(",", c) - 1)
  return(as.numeric(tmp))
}

nbin <- 10
set.seed(123)
dat <- data.table(value = rnorm(1000,0, 20),
                  dist = c(rep(0, 15), sample(1:490), sample(-1:-495)))

dat[, log := log2(abs(dist) + 1)]
dat[, labels := 2^(abs(log))]
dat[, sign := ifelse(dist == 0, 
                     0,
                     ifelse(dist > 0, "+", "-"))]

dat[, bin := ifelse(sign == 0, 
                    0,
                    ifelse(sign == "+", 
                           lower.bound(log, nbin),
                           -lower.bound(log, nbin)))]

sumdat <- dat[, .(mvalue = mean(value),
                  nvalue = .N,
                  ylab = mean(dist)), 
                 by = .(bin, sign)][order(bin)]

ggplot(sumdat, aes(x = ylab, y = mvalue)) + geom_line()
David
  • 9,216
  • 4
  • 45
  • 78
  • See here for further discussion: http://chat.stackoverflow.com/rooms/94732/discussion-between-david-and-user2183097 I will update the final answer afterwards – David Nov 10 '15 at 16:22
  • Thank you David... I was looking for a way to label the x axis with the original non-logarithmic values so that the x-axis range is range(dat$dist) [1] -495 490, is this possible? – user971102 Nov 10 '15 at 16:29
  • Thanks to David, this works perfectly: sumdat <- dat[, .(mvalue = mean(value), nvalue = .N, ylabel = mean(dist)), by = .(bin, sign)][order(bin)] ggplot(sumdat, aes(x = ylabel, y = mvalue)) + geom_line() – user971102 Nov 10 '15 at 17:02