0

I am trying to make a heatmap of a dissimilar matrix that a lot of NAs. However, I ran into problems when trying to perform clustering. Without clustering the heatmap works fine. I do not want to impute/remove the NAs. Is there anyway to perform clustering? I understand that with NAs calculating distance is a problem but there should be a way around it, right?

I get the following error message:

" Error in hclust(get_dist(submat, distance), method = method) : NA/NaN/Inf in foreign function call (arg 10)

In addition: Warning message: NA exists in the matrix, calculating distance by removing NA values."

Edit:

The data I am using is an unusual matrix with a lot of NAs. Perhaps this is the problem? But I would like to visualize these NAs in the heatmap as well. So only cluster rows but not the columns.

dissimilar matrix example

Wolfje
  • 1
  • 4

2 Answers2

1

I'm not sure why you are getting that error. The dist function should handle NAs by default. Below is an example. Also, you might want to simply compute your distance matrix first, and then feed to hclust. The vegan package can calculate many distance metrics, and you can specify if NAs should be removed:

# data
set.seed(1)
x <- matrix(rnorm(100), nrow = 5)
df <- matrix(rnorm(100), nrow = 5)

# make missing values
nvals <- length(c(df))
set.seed(1)
df[sample(x = nvals, size = nvals*0.1)] <- NaN

# distance "euclidean"
hc <- hclust(dist(df), method = "ave")
plot(hc)


# other distance metrics
D <- vegdist(df, method = "manhattan", na.rm = TRUE)
hc <- hclust(D, method = "ave")
plot(hc)
Marc in the box
  • 11,769
  • 4
  • 47
  • 97
  • Thanks Marc. I see in your example, the data has only a few NAs which might help with clustering anyway. Please take a look at the example I added now to my question. The table has a lot of NAs for many rows. I do not want to delete them but wanna visualize them in the heatmap. – Wolfje Jun 09 '21 at 08:50
  • The issue is probably when you have zero values that match between any two samples - then you can not calculate a distance metric. You will likely have to filter out samples that contain too many NAs – Marc in the box Jun 09 '21 at 09:27
  • Yes, I am aware of that solution when you remove them, it is possible to cluster. But, there should be a way to visualize them without filtering out any sample. I'll keep looking. – Wolfje Jun 09 '21 at 09:41
0

Okay, I managed to solve this problem. I had to do simple imputation. I just replaced all NAs with a "constant".

Then I can visualize the entire dataset without removing any samples or rows, cluster both rows and columns. Then, when I want to plot where the NAs are in the dataset, I just had to give the "constant" a specific colour in any plot.

In this way, I treat all the NAs the same without assigning NAs in each row/column a value based on other samples (such as mean/median/regression methods). This method works best for my dataset without skewing them in any direction.

Wolfje
  • 1
  • 4