2

I asked people how many years they have been smoking and afterwards I calculated the risk to die for groups of smoking duration. Lets assume this data:

df <- data.frame(years_smoke= c(1,2,2,3,3,3,4,5,6, 6,7, 10, 11, 12, 12, 14, 15),
risk_death= c(rep(.1, 8), rep(.3, 4), rep(.7, 5)))

Here the continuous variables years_smoke is split up into three groups (1 to 5 years, 6 to 10 years and 11 to 15 years) and each group has a death risk value (it is .1 for those smoking 1 to 5 years, .3 for those smoking 6 to 10 years and .7 for the once smoking 11 to 15 years).

I want to plot the countinuous variable years_smoke as a histogram and to colour the columns by the risk of the groups like a heatmap, where low risk to die is green and high risk to die is red, for example. So far, in the comments (and in two deleted answers) something like this was suggested:

library(ggplot2)
ggplot(df, aes(years_smoke, fill= factor(risk_death))) + geom_histogram()

But this does not work as expected. If we change the data to

data.frame(years_smoke= c(1,2,2,3,3,3,4,5,6, 6,7, 10, 11, 12, 12, 14, 15),
risk_death= c(rep(.1, 8), rep(.3, 4), rep(999, 5)))

we will get the very same plot as before. But in case of a heatmap this should result in very different colours where all columns with risk .1 and .3 have pretty much the same green colours and the risk group 999 has a very red colour. This question was marked as a duplicate. But the link provided also does not give colours like in a heatmap because of the usage of fill by a factor, where the colours do not depent on the actual value of a continuous variable.

(data is made up)

  • I don't understand how you can plot a meaningful heat map from a single scalar variable. Your last plot doesn't make sense for me. – ziggystar Aug 29 '19 at 13:26
  • @ziggystar The last plot shows that the risk is increasing for higher x values. I add a legend for that. Of course a histogram usualy does not look like that. I generated a uniform distribution because that is easy to do with rep(...). But the distribution is not relevant to the question. –  Aug 29 '19 at 13:29
  • Do you want to have a bar chart, where the height of the columns shows the frequency of the different risk levels? This would be a histogram with coloring of bars proportional to x-scale. – ziggystar Aug 29 '19 at 13:31
  • @ziggystar I need a histogram where the heigh of the columns reflects the count of a continuous variables x and the histogram is supposed to be coloured like a heatmap depending on another continuous variable risk. I added a legend. –  Aug 29 '19 at 13:37
  • Just add `y= ..count..` to your code and it will work. – pogibas Aug 29 '19 at 13:52
  • @PoGibas `ggplot(data = df, aes(x = x, y= ..count.., fill= risk)) + geom_histogram(binwidth= 1)` produces the first plot of my question and so does `ggplot(data = df, aes(x = x, fill= risk)) + geom_histogram(binwidth= 1, aes(y= ..count..))` –  Aug 29 '19 at 13:55
  • 3
    Possible duplicate of [Coloring a geom\_histogram by gradient](https://stackoverflow.com/questions/43795211/coloring-a-geom-histogram-by-gradient) – Mojoesque Aug 29 '19 at 14:26
  • @machine, what if similar values of `x` (i.e., that are in the same bin) have different values of `risk`? How would you fill the columns of the histogram in this case? If there is no such case, then @ziggystar comment should work. In this case, `risk` is just a function of `x`, and you can map the `fill` color to `x` (like in a similar answer linked in @Mojoesque 's comment above). – kikoralston Aug 29 '19 at 17:03
  • @machine, you can get the result you want using `ggplot(data = df, aes(x = x, y= ..count.., fill=factor(risk))) + geom_histogram(binwidth= 1)` – kikoralston Aug 29 '19 at 17:11
  • @kikoralston no, there is no such case. Anyway, the...factor(risk)... solution does not work as a heatmap since changing the risk from c(1,2,3,4) to c(1,2,3,1000), for example, results in the same plot. But in a heatmap one would expect the bins woth risk 100 to be very dark blue and all the others pretty similiar light blue. This does not eork with factor(risk) because using a factor does not give colors that gradually depend on the continuous variable. –  Aug 29 '19 at 20:04
  • @mojoesque unfortunately, thr link does not answer my question. See the edit why this is the case. –  Aug 29 '19 at 20:19

1 Answers1

1

In this case it might be easiest to just build your own histogram. You mentioned there will be no cases where the same number of years smoking lead to different risks, therefore something like this should do the trick:

library(tidyverse)    
df <- data.frame(years_smoke= c(1,2,2,3,3,3,4,5,6, 6,7, 10, 11, 12, 12, 14, 15),
                     risk_death= c(rep(.1, 8), rep(.3, 4), rep(.7, 5))) %>%
  group_by(years_smoke) %>%
  summarize(n = n(), risk_death = mean(risk_death))

df %>%
  ggplot(aes(x = years_smoke, y = n, fill = risk_death))+
    geom_col()

enter image description here

(Depending on what your risk value actually is another summary function than mean might be appropriate, but the mean works for your example data.)

If you now change for example the risk of the last 5 cases from 0.7 to 10 you get your desired behaviour: enter image description here

Mojoesque
  • 1,166
  • 8
  • 15
  • 1
    This is great! Thanks so much! @"summary function than mean might be appropriate": yes, it is. In fact, since for every group of years_smoke there is just one risk value, using mean() or unique() gives same output for my data. Thanks! –  Aug 31 '19 at 07:08