0

I have a dataframe with columns for different attributes and a column for the class label. I am trying to create a Heatmap/matrix plot of all the attributes with the data points categorized by their class label.

If I turn my dataframe into a numeric matrix, I can use the heatmap function to create a heatmap:

q3 <- read.arff("diabetes.arff")  

q3_m <- as.matrix(q3[,1:8])  

heatmap(q3_m, Colv=NA, Rowv=NA) 

However, I can't figure out how to order these by the class variable, as I had to remove it from the matrix because it isn't numeric.

If I transform the data into the long format, I can also make the following heatmap using ggplot:

q3_long <- pivot_longer(q3, preg:age, names_to = "Attribute", 
    values_to = "Value")

ggplot(data = q3_long, mapping = aes(x = Attribute, y=class, fill = Value)) +
    geom_raster() +
    xlab(label = "Attribute")

However, this averages the values of every case in a given class rather than showing every case as a separate row with its own fill.

How can I combine these approaches to get a heatmap that clusters the cases by class?

(Apologies in advance - I attempted to include images here ,but I just joined stackoverflow and therefore don't have the 10 reputation points needed to include images).

Thanks for your help.

Edit: here is a sample of the data. It is also publicly available - the diabetes.arff dataset is automatically downloaded with Weka installation (https://waikato.github.io/weka-wiki/downloading_weka/).

structure(list(preg = c(6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 
10, 1, 5, 7, 0, 7, 1, 1), plas = c(148, 85, 183, 89, 137, 116, 
78, 115, 197, 125, 110, 168, 139, 189, 166, 100, 118, 107, 103, 
115), pres = c(72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 
80, 60, 72, 0, 84, 74, 30, 70), skin = c(35, 29, 0, 23, 35, 0, 
32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30), insu = c(0, 
0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 
83, 96), mass = c(33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31, 35.3, 
30.5, 0, 37.6, 38, 27.1, 30.1, 25.8, 30, 45.8, 29.6, 43.3, 34.6
), pedi = c(0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 
0.134, 0.158, 0.232, 0.191, 0.537, 1.441, 0.398, 0.587, 0.484, 
0.551, 0.254, 0.183, 0.529), age = c(50, 31, 32, 21, 33, 30, 
26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33, 32), class = structure(c(2L, 
1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 
2L, 1L, 2L), .Label = c("tested_negative", "tested_positive"), class = "factor")), row.names = c(NA, 
20L), class = "data.frame")
SB_neuro
  • 1
  • 3
  • Welcome to SO! To help us to help you could you please make your issue reproducible by sharing a sample of your **data**? See [how to make a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Simply type `dput(NAME_OF_DATASET)` into the console and copy & paste the output starting with `structure(....` into your post. If your dataset has a lot of observations you could do `dput(head(NAME_OF_DATASET, 20))` for the first twenty rows of your data. – stefan Mar 08 '21 at 16:31
  • Done! Thanks for your help. – SB_neuro Mar 08 '21 at 20:12

1 Answers1

0

Maybe this is what you are looking for. To get a heatmap by cases you could add an id variable to your dataset which you could map on x and make use of faceting to cluster the cases by class:


library(tidyr)
library(ggplot2)
library(dplyr)

q3_long <- q3 %>% 
  mutate(id = row_number(), id = factor(id)) %>% 
  pivot_longer(-c(class, id), names_to = "Attribute",  values_to = "Value")

ggplot(data = q3_long, mapping = aes(x = Attribute, y = id, fill = Value)) +
  geom_raster() +
  xlab(label = "Attribute") +
  facet_wrap(~class, scales = "free_y")

stefan
  • 90,330
  • 6
  • 25
  • 51