13

I am trying to create a heatmap using the heatmap.2 package. My data has lot's of NaN values in it, and what I would like to do is the following. Every time there is a NaN value, simply have the cell be colored as light grey (or some other neutral color, maybe white), and all of the other values (which are log2 expression) to have a standard green/yellow/red coloring scheme. Here is my code that I have ben using:

heatmap.2(as.matrix(foo2[rowSums (abs(foo2)) != 0,]),
          col = redgreen,
          margins = c(12, 22),
          trace = "none", 
          xlab = "Comparison",
          lhei = c(2, 8),
          scale = c("none"),
          symbreaks = min(foo2 = 0, na.rm = TRUE),
          na.color = "blue",
          cexRow = 0.5,
          cexCol = .7,
          main = "DE geness",
          Colv = F)

This works well when there is no NaN values, but when the data has NaN, I am getting an error which says:

Error in hclustfun(distfun(x)) : 
  NA/NaN/Inf in foreign function call (arg 11)

Essentially, I would like to have it ignore the NaN's in the data. I am not sure how to handle this. any help would be greatly appreciated.

zx8754
  • 52,746
  • 12
  • 114
  • 209
user1352084
  • 459
  • 3
  • 6
  • 13
  • 1
    Just convert the NA's to a number outside the range of the the others and specify breaks and a palette that match your needs. – IRTFM Dec 31 '13 at 15:55
  • Given the nonreproducibility answer below, first make sure your "NaN" are truly `NaN` and not strings or some other dreck. Then verify that each function you've called inside your `heatmap.2` call returns the class of data you expect. For example, `symbreaks=min(foo2 = 0, na.rm=TRUE)` is a strange way to check whether there are any `0` values in `foo2` . – Carl Witthoft Dec 31 '13 at 16:46
  • 3
    @BondedDust, converting NA into values (even in the dist matrix) will affect clustering. – Zhilong Jia Apr 17 '15 at 23:18

5 Answers5

13

TL;DR: The issue is likely due to delegated distfun and not the heatmap2 function itself. The default dist function tries to calculate the distance between your data points, and if the distance calculation returns an NA, the clustering function cannot handle that.


The longer version:

I have recently experienced the same issue as the OP, and had to dig in quite a bit to understand why the problem wasn't reproducible for others.

The essential issue is as follows: heatmap2 by default passes hclust and hclustfun and dist as distfun parameters. The error message clearly states that it's hclustfun (which in this case defaults to hclust) that does not like the NAs.

The next bit of information is this: even though the data matrix includes NAs the results of dist (which are passed in to hclust) might be free from NAs, which is the case for @kdauria's answer. See below:

> library(gplots)
> mat = matrix( rnorm(25), 5, 5)
> mat[c(1,6,8,11,15,20,22,24)] = NaN
> 
> heatmap.2( mat,
+            col = colorpanel(100,"red","yellow","green"),
+            margins = c(12, 22),
+            trace = "none", 
+            xlab = "Comparison",
+            lhei = c(2, 8),
+            scale = c("none"),
+            symbreaks = min(mat, na.rm=TRUE),
+            na.color="blue",
+            cexRow = 0.5, cexCol = 0.7,
+            main = "DE genes", 
+            dendrogram = "row", 
+            Colv = FALSE )
> ?dist
> mat
           [,1]       [,2]        [,3]        [,4]       [,5]
[1,]        NaN        NaN         NaN -1.10103187 -1.4396185
[2,] -0.8821449  1.4891180  0.41956063 -0.06442867        NaN
[3,] -2.5912928        NaN -0.56603029 -0.55177559 -2.0313602
[4,]  0.8348197  0.2199583  0.06318663  1.59697764        NaN
[5,] -0.2632078 -1.2193110         NaN         NaN  0.8618543
> dist(mat)
         1        2        3        4
2 2.317915                           
3 1.276559 2.623637                  
4 6.032933 3.050821 5.283828         
5 5.146250 4.392798 5.871684 2.862324

The random valued matrix does not reproduce the problem because it avoids the issue at hand. Which brings me to the question: what does it take to get NAs from dist?


My data had some outlying large values which I thought to be the reason, however I only managed to reproduce the problem by adding a row of NAs:

> mat = matrix(rnorm(49), 7, 7)
> mat[c(3,17,28, 41)] = mat[c(3,17,28, 41)] * 100000
> mat
              [,1]        [,2]          [,3]          [,4]        [,5]          [,6]       [,7]
[1,] -6.175928e-01  1.68691561 -1.233250e+00 -7.355322e-01 -0.37392178  3.559804e-01  1.7536137
[2,]  6.680429e-01  0.90590237 -1.375424e+00  5.842512e-01 -0.09376548 -3.556098e-01 -1.2926535
[3,] -3.739372e+04 -1.74534887 -2.241643e+05 -2.209226e-01 -0.86769435 -4.590908e-01  1.6306854
[4,] -1.283405e+00  0.20698245  3.635557e-01  3.673208e-01 -0.12339047  1.119922e+00  0.4301094
[5,] -5.430687e-02 -0.75219479  2.609126e+00 -1.340564e-01  0.54016622  2.885021e-01  0.9237946
[6,] -8.395116e-01  0.03675002  2.455545e+00  4.432025e-02 -0.86194910  1.302758e+05  0.6062505
[7,]  1.817036e-01 -1.46137388 -1.853179e+00 -2.177306e+03  2.36763806 -2.273134e+00  1.2440088
> dist(mat)
             1            2            3            4            5            6
2 3.726858e+00                                                                 
3 2.272605e+05 2.272606e+05                                                    
4 2.966078e+00 3.537475e+00 2.272620e+05                                       
5 4.787577e+00 5.039154e+00 2.272644e+05 3.016614e+00                          
6 1.302754e+05 1.302762e+05 2.619559e+05 1.302747e+05 1.302755e+05             
7 2.176576e+03 2.177895e+03 2.272705e+05 2.177679e+03 2.177179e+03 1.302963e+05
> mat = rbind(mat[1:4, ], rep(NA,7), mat[5:6, ])
> mat
              [,1]        [,2]          [,3]        [,4]        [,5]          [,6]       [,7]
[1,] -6.175928e-01  1.68691561 -1.233250e+00 -0.73553223 -0.37392178  3.559804e-01  1.7536137
[2,]  6.680429e-01  0.90590237 -1.375424e+00  0.58425125 -0.09376548 -3.556098e-01 -1.2926535
[3,] -3.739372e+04 -1.74534887 -2.241643e+05 -0.22092261 -0.86769435 -4.590908e-01  1.6306854
[4,] -1.283405e+00  0.20698245  3.635557e-01  0.36732078 -0.12339047  1.119922e+00  0.4301094
[5,]            NA          NA            NA          NA          NA            NA         NA
[6,] -5.430687e-02 -0.75219479  2.609126e+00 -0.13405635  0.54016622  2.885021e-01  0.9237946
[7,] -8.395116e-01  0.03675002  2.455545e+00  0.04432025 -0.86194910  1.302758e+05  0.6062505
> dist(mat)
             1            2            3            4            5            6
2 3.726858e+00                                                                 
3 2.272605e+05 2.272606e+05                                                    
4 2.966078e+00 3.537475e+00 2.272620e+05                                       
5           NA           NA           NA           NA                          
6 4.787577e+00 5.039154e+00 2.272644e+05 3.016614e+00           NA             
7 1.302754e+05 1.302762e+05 2.619559e+05 1.302747e+05           NA 1.302755e+05
> heatmap.2( mat,
+            col = colorpanel(100,"red","yellow","green"),
+            margins = c(12, 22),
+            trace = "none", 
+            xlab = "Comparison",
+            lhei = c(2, 8),
+            scale = c("none"),
+            symbreaks = min(mat, na.rm=TRUE),
+            na.color="blue",
+            cexRow = 0.5, cexCol = 0.7,
+            main = "DE genes", 
+            dendrogram = "row", 
+            Colv = FALSE )
Error in hclustfun(distfun(x)) : 
  NA/NaN/Inf in foreign function call (arg 11)

However the situation does not appear to be specific to the case where there is a row entirely composed of NAs. For example:

> mat
              [,1]        [,2]          [,3]       [,4]       [,5]          [,6]       [,7]
[1,]           NaN         NaN           NaN        NaN         NA -7.531027e-01  0.2238252
[2,]  3.210084e-01 -1.55702840  2.777516e-01  0.2176875  1.3310334 -9.621561e-01        NaN
[3,]  1.159837e+05  0.04480172 -1.649482e+04        NaN  2.4748122  8.446133e-01 -0.4240776
[4,] -8.584051e-01         NaN           NaN  1.0557713 -1.0855826 -5.638023e-02 -0.3789979
[5,]            NA          NA -2.539003e-01 -0.4552776  0.3856384            NA         NA
[6,]           NaN  1.31986556           NaN -1.0393147 -1.9197183 -1.434064e+00  0.6334569
[7,]           NaN -0.42180912           NaN -0.8023476 -0.8264077  4.471358e+04  0.5046408
> dist(mat)
             1            2            3            4            5            6
2 5.531033e-01                                                                 
3 3.225471e+00 1.386143e+05                                                    
4 1.723619e+00 3.913983e+00 1.534332e+05                                       
5           NA 1.949799e+00 3.085851e+04 3.945524e+00                          
6 1.486699e+00 6.010961e+00 6.905415e+00 3.743585e+00 4.449179e+00             
7 8.365286e+04 5.915178e+04 5.914939e+04 5.915058e+04 2.358664e+00 5.290752e+04
posdef
  • 6,498
  • 11
  • 46
  • 94
1

Just a suggestion for a practical solution in addition to posdef's very instructive answer:

Since distfun is only used to determine the structure of the dendrogram, you can simply replace the NA's in the dist matrix with values that are a bit higher than the maximum of the non-NA values.

For this, we need a new distance function (one that wraps the normal dist function and just replaces NAs):

dist_no_na <- function(mat) {
    edist <- dist(mat)
    edist[which(is.na(edist))] <- max(edist, na.rm=TRUE) * 1.1 
    return(edist)
}

and make use of this function in the heatmap.2 call:

heatmap.2(mat, ..., dendrogram="row", Colv="NA", na.color="black", distfun=dist_no_na)

Properties

This is of course not a perfect solution. It assigns numerical distance values to pairs of vectors for which there is no basis on which a (euclidean?) distance can be computed. However, it does have some desirable properties.

  1. The heatmap.2 function works :-)

  2. Rows that only contain NA's for instance are then split from the main branch first (which reflects the issue at hand nicely).

  3. I am not entirely certain which effect it has to replace NA values that are caused by other properties of the matrix. posdef pointed out that there may be such NA values. In posdef's example, there are two rows for which there is no pair of non-NA entries in the same column - i.e. it is impossible to determine a euclidean distance. It is in this case, probably still be appropriate to reflect this as a particularly large distance larger than all those that can be computed numerically.

I would not choose a replacement value much larger than the non-NA maximum. (The chosen value in the code above is 10% larger.) This would increase the distance of the split-off point of all-NA rows to the following split-off points (the relevant part of the dendrogram) and may make the relevant part of the dendrogram difficult to see.

0range
  • 2,088
  • 1
  • 24
  • 32
0

So I am not an expert in coding at all, but I have been learning to make heatmaps on R and I kept having the same error message for my NA data. It turns out the reason I was getting the error message was there were NA terms in the first column in my data and R did not like that at all. So I added an extra column and filled it with 1's and it worked!! I hope maybe someone will find this useful!

Kahina

Kahina
  • 1
  • You should add more detail to your answer. How many 1 for example. – Xantium Dec 07 '17 at 00:32
  • 1
    @Simon Kahina's solution implies adding one column of identical values to the matrix. Consequently, as many 1 as the matrix has rows. This solution enables computation of an euclidean distance between all row vectors, but adds a column for which there is no interpretation to the resulting figure. – 0range May 10 '18 at 18:21
0

I apologise if this seems like I am over simplifying it but I know I would appreciate a simplified post like this (since I am no expert in R). I found this the easiest method so far and I'll show it with my data;

My data ranges from 0 to 114 in a data matrix with a lot of NA values so what I did was first replace all NA values with -1 (below the range of my dataset)

x <- mymatrix %>% replace(is.na(.), -1)

then I set breaks using heatmap.2(). If you want your NA values to be let's say "black" and the rest of the values to use a colourpalette with a range of colours then set your breaks using seq(). Since my data ranges from 0 to 114, I set my seq to go from 0 to 114 by increments of 1. Then using heatmap.2() I set the breaks as -1 and then my sequence (so the breaks would look like (-1,0,1,2,3..etc). I set the colours to be "black" for the -1 values (the NAs) and use 114 colours from the bluered palette for the remaining values.

seq <- seq(from = 0, to = 114, by = 1)
heatmap.2(x, col = c("black", bluered(114)), 
      trace = "none", density.info = "none", breaks=c(-1,seq))

I hope this is helpful!

-1

I can't reproduce the problem. The code below works just fine. All of the NaN values are colored blue.

library(gplots)
mat = matrix( rnorm(25), 5, 5)
mat[c(1,6,8,11,15,20,22,24)] = NaN

heatmap.2( mat,
           col = colorpanel(100,"red","yellow","green"),
           margins = c(12, 22),
           trace = "none", 
           xlab = "Comparison",
           lhei = c(2, 8),
           scale = c("none"),
           symbreaks = min(mat, na.rm=TRUE),
           na.color="blue",
           cexRow = 0.5, cexCol = 0.7,
           main = "DE genes", 
           dendrogram = "row", 
           Colv = FALSE )

enter image description here

kdauria
  • 6,300
  • 4
  • 34
  • 53
  • This seems a little misleading to use as a colour pallete though, as blue is just an extension of the pallette you've chosen... Surely black or white would be a better choice? – will Mar 05 '15 at 13:11
  • 2
    This example does not reflect the case most likely experienced by the OP as it does not replicate the situation where `dist(mat)` includes `NA`s. See my answer below – posdef Mar 05 '15 at 13:12