How to create a customized heatmap in R?

Question

I saw this photo of a heatmap on the internet, but no code was given.

I'd like to make the same heatmap for my data. My data looks like this.

I want the variable mean_temp on the x-axis, total_count on the y-axis, and I want the boxes in the heatmap to be filled with the values of lwd_duration. Here is a reproducible example of my data

df <- structure(
  list(
    total_count = c(
      10L,
      0L,
      15L,
      0L,
      20L,
      0L,
      0L,
      50L,
      0L,
      6L,
      1L,
      10L,
      7L,
      0L,
      0L,
      29L,
      0L,
      2L,
      11L,
      3L,
      0L,
      12L,
      0L,
      30L,
      0L,
      0L,
      29L,
      44L,
      10L,
      5L,
      2L,
      145L,
      0L,
      70L
    ),
    mean_temp = c(
      18.87,
      18.87,
      18.87,
      18.87,
      18.87,
      18.87,
      18.87,
      18.87,
      18.87,
      21.85,
      21.85,
      21.85,
      21.85,
      21.85,
      21.85,
      21.85,
      21.85,
      21.85,
      17.11,
      17.11,
      17.11,
      17.11,
      17.11,
      17.11,
      17.11,
      17.11,
      18.82,
      18.82,
      18.82,
      18.82,
      18.82,
      18.82,
      18.82,
      18.82
    ),
    lwd_duration = c(
      64.32,
      64.32,
      64.32,
      64.32,
      64.32,
      64.32,
      64.32,
      64.32,
      64.32,
      104.2,
      104.2,
      104.2,
      104.2,
      104.2,
      104.2,
      104.2,
      104.2,
      104.2,
      53.53,
      53.53,
      53.53,
      53.53,
      53.53,
      53.53,
      53.53,
      53.53,
      60.43,
      60.43,
      60.43,
      60.43,
      60.43,
      60.43,
      60.43,
      60.43
    )
  ),
  row.names = c(NA,-34L),
  class = c("tbl_df", "tbl", "data.frame"),
  na.action = structure(
    c(
      `4` = 4L,
      `5` = 5L,
      `6` = 6L,
      `7` = 7L,
      `8` = 8L,
      `9` = 9L,
      `78` = 78L,
      `87` = 87L,
      `96` = 96L,
      `105` = 105L,
      `114` = 114L,
      `123` = 123L,
      `132` = 132L,
      `141` = 141L,
      `150` = 150L,
      `159` = 159L,
      `168` = 168L,
      `177` = 177L,
      `186` = 186L,
      `849` = 849L,
      `850` = 850L,
      `851` = 851L,
      `852` = 852L,
      `891` = 891L,
      `892` = 892L,
      `893` = 893L,
      `894` = 894L,
      `921` = 921L,
      `922` = 922L,
      `923` = 923L,
      `924` = 924L,
      `937` = 937L,
      `938` = 938L,
      `939` = 939L,
      `940` = 940L,
      `969` = 969L,
      `970` = 970L,
      `971` = 971L,
      `972` = 972L,
      `985` = 985L,
      `986` = 986L,
      `987` = 987L,
      `988` = 988L,
      `1017` = 1017L,
      `1018` = 1018L,
      `1019` = 1019L,
      `1020` = 1020L,
      `1033` = 1033L,
      `1034` = 1034L,
      `1035` = 1035L,
      `1036` = 1036L
    ),
    class = "omit"
  )
)

Could someone help with the code to create the above heatmap for my data? Thank you!

I don't think your data is suitable for a heatmap. For example you say you want "total count" on the y axis. But in total count you have 20 observations with 0, and just 1 or 2 observations for all other values. — Karolis Koncevičius, Apr 08 '23 at 11:44
@KarolisKoncevičius I have edited the question to reduce the number of zeros in the data. Hope the date set is fine now. My actual data set is huge — Ahsk, Apr 08 '23 at 12:07

score 1 · Accepted Answer · answered Apr 08 '23 at 12:41

1

This is very similar to Create heatmap with values from matrix in ggplot2, but the original data is different enough that I think it's not a dup. Here's a modification of that answer to fit your dataset:

library(ggplot2)
library(tidyverse)

df <- <your data from the question>

## convert to tibble and change variables to factors
dat2 <-
  df %>%
  as_tibble() %>%
  mutate(
    mean_temp = cut_interval(mean_temp, n = 10),
    total_count = cut_interval(total_count, n = 10),
  ) %>%
  group_by(mean_temp, total_count) %>%
  summarize(lwd_duration = mean(lwd_duration))
#> `summarise()` has grouped output by 'mean_temp'. You can override using the
#> `.groups` argument.

ggplot(dat2, aes(mean_temp, total_count)) +
  geom_tile(aes(fill = lwd_duration)) +
  geom_text(aes(label = round(lwd_duration, 1))) +
  scale_fill_gradient(low = "white", high = "red")

^{Created on 2023-04-08 with reprex v2.0.2}

answered Apr 08 '23 at 12:41

user2554330

37,248
4
43
90

Thanks very much. I ran your code on my data - this is exactly what I wanted. I was wondering if we can get more values of `lwd_duration` though? By taking mean of `lwd_duration`, I am getting very fewer number of `lwd_duration` in the cells. The range of `lwd_duration` in my data is `1.62 184.48`, but the `lwd_duration` in graph goes up to 125. Other two variables `mean_temp` and `total_count` are fine and fully cover the range of my data. I am editing the question to include the figure I got from running your code on my data. – Ahsk Apr 08 '23 at 13:18
@Ahsk: you probably have multiple observations in each cell. Suppose for one particular range of `mean_temp` combined with one range of `total_count` you have values 5, 15, 100. What do you want to show in that cell? I showed the mean (i.e. 40 for these numbers), but you could show the min, or the max, or use any other single number to set the color, and then show more than one number in the string that's being displayed. – user2554330 Apr 08 '23 at 15:30
Do I really need this step `group_by(mean_temp, total_count)` ? If I group_by location and week, then I seem to get more values in cells. Thanks – Ahsk Apr 08 '23 at 15:43
You don't have `location` and `week` as variables in the sample dataset, so I have no idea. – user2554330 Apr 08 '23 at 15:53
Yes, I just posted a simple example. In my experiment, Plants were put out for a week and then taken back to a glasshouse to count disease lesions on plant. In the above example, my response variables in lesions per plant. But if change response variables to number of infected leaves per week or per treatment then ofcourse I have a wide range of values. Just wanted to confirm that if `group_by(mean_temp, total_count)` is absolutely required or not? – Ahsk Apr 08 '23 at 15:57
You need to group by whatever determines the cells. You can use any variable you like to determine them. You need to use the same variables to replace `aes(mean_temp, total_count)` in the `ggplot` call, or the labels on the axes won't make any sense. – user2554330 Apr 08 '23 at 16:00
Apologies. I know the question is closed but I am just confirming. I manage to get more cells by changing `n_cuts` from `10` to `60` in your mutate step. `mutate( mean_temp = cut_interval(mean_temp, n = 10), total_count = cut_interval(total_count, n = 60),`. This is not going to introduce any bias or leading to misleading results, is that right? – Ahsk Apr 09 '23 at 13:59
@Ahsk, it won't lead to bias, but it will increase the variance. Presumably having 6 times as many cells will mean they have about 6 times fewer values in each; and the variance will be about 6 times higher. – user2554330 Apr 09 '23 at 19:49
That's fine, thanks. As long as it doesn't introduce any bias, I will have more values on y-axis + more cells to compensate for fewer values in each cell. – Ahsk Apr 10 '23 at 10:47

How to create a customized heatmap in R?

1 Answers1