3

I'm trying to cluster data and then plotting a heatmap using heatmap.2 from gplots, the script works perfectly with matrices up to 30000 rows, the problems is that I'm using matrices up to 500000 rows (data_sel), and when I try to cluster I get this error:

heatmap.2(as.matrix(data_sel),col=greenred(10), trace="none",cexRow=0.3, cexCol=0.3,  ColSideColors=fenot.colour, margins=c(20,1), labCol="", labRow="",distfun=function(x) dist(x,method="manhattan"))
Error in vector("double", length) : vector size specified is too large

Is there any approximation using R to plot heatmaps with his big data?

Thanks in advance

Marek
  • 49,472
  • 15
  • 99
  • 121
user976991
  • 411
  • 1
  • 6
  • 17
  • 3
    `heatmap.2` is in `gplots`, not `ggplot2`. I know I've answered questions about this on SO before -- the problem usually comes in the distance-computation stage. Have you tried searching SO yet ... e.g. http://stackoverflow.com/questions/5667107/r-how-can-i-make-a-heatmap-with-a-large-matrix ? – Ben Bolker Jan 17 '12 at 15:06
  • Agreed that the problem comes from the distance calculation. If you have n=500,000 rows, then you need to store n*(n-1)/2=12.5 trillion distances. If your distances are of type `double`, then each distance takes 8 bytes (I think that's right). That means you need at least 100 trillion bytes (100 terabytes) to store the distance matrix. Hence, you are going to need to find a way to (a) summarize your data to a more manageable size or (b) find some code that doesn't store the entire distance matrix when clustering. – kdauria Feb 16 '14 at 03:24
  • Depending on your data, you might consider just sampling your data. In my experience, I've gotten very similar looking heatmaps when using just 1,000 rows of my 50,000-row matrix. – kdauria Feb 16 '14 at 03:34

1 Answers1

0

This is a very old question but here is a solution using python that will scale up to data that size:

import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt

df = pd.read_table('file.txt', header=0)
df = df.set_index('ID')
del df.index.name
df

sns.set(font_scale=0.1)

cm = sns.clustermap(df, metric="euclidean", standard_scale=1, method="ward", cmap="viridis", robust='TRUE', yticklabels=1)
cm.cax.set_visible(False)
cm.savefig('heatmap.pdf')

It will probably take a couple of hours to run, but it worked for me on a ~1000 * ~36000 matrix.

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46