0

I asked this question here: How to convert occurence matrix to co-occurence matrix

I realized that my data is so big that it is not possible to do this using R. My computer hangs. The actual data is a text file with ~5 million rows and 600 columns. I think Python may be an alternate option to do this.

Community
  • 1
  • 1
learner
  • 2,582
  • 9
  • 43
  • 54
  • What's your question? – Wooble Oct 08 '13 at 18:06
  • Why are you asking the same question again if you already asked it in that other question? – BrenBarn Oct 08 '13 at 18:07
  • @BrenBarn: the other question is about implementing it in R. This question is about implementing it in Python. – Tamás Oct 08 '13 at 18:08
  • @Tamás: The other question is also tagged "Python". – BrenBarn Oct 08 '13 at 18:09
  • Initially i thought R would be able to do this. But my actual data is so big that R takes forever to read that into memory. That is why I asked this question again. – learner Oct 08 '13 at 18:10
  • 3
    Assuming I understand you and the output matrix you expect is 600x600, then R can handle this too. You don't need to store the whole file in memory at once, after all. You can certainly do it easily in Python as well, but if you already have processing tools you're using in R it's probably not worth porting just for this. – DSM Oct 08 '13 at 18:14

1 Answers1

0

This would be the way you translate the R code to Python code.

>>> import numpy as np
>>> a=np.array([[0, 1, 0, 0, 1, 1],
             [0, 0, 1, 1, 0, 1],
             [1, 1, 1, 1, 0, 0],
             [1, 1, 1, 0, 1, 1]])
>>> acov=np.dot(a.T, a)
>>> acov[np.diag_indices_from(acov)]=0
>>> acov
array([[0, 2, 2, 1, 1, 1],
       [2, 0, 2, 1, 2, 2],
       [2, 2, 0, 2, 1, 2],
       [1, 1, 2, 0, 0, 1],
       [1, 2, 1, 0, 0, 2],
       [1, 2, 2, 1, 2, 0]])

However, you have a very big dataset. If you don't want to assemble the co-occurence matrix piece by piece and you store your values in int64, with 3e+9 numbers it will take 24GB of RAM alone just to hold the data http://www.wolframalpha.com/input/?i=3e9+*+8+bytes. So you probably want to think over and decide which dtype you want to store your data in: http://docs.scipy.org/doc/numpy/user/basics.types.html. Using int16 probably will make the dot product operation possible on a decent desktop PC nowadays.

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • 2
    consider converting to sparse matrices ([`scipy.sparse`](http://docs.scipy.org/doc/scipy/reference/sparse.html)) – ali_m Oct 08 '13 at 19:08