0

I have a defaultdict which stores data co-occurence of every 2 words in a dataset. I have done this os as to get a sparse representation as not every pair is present in the dataset and hence to save some memory space.

Now are there some standard function which can convert this defaultdict to a numpy matrix, if possible to some sparse representation. I am aware of how to convert a dict to numpy array. But I am looking for a more efficient way of converting the defaultdict to matrix.

If it is not possible is there any standard function that convert the defaultdict to a CSV, so that I can load the csv using numpy.

Edit - I have found a workaround using pandas. I convert the defaultdict to DataFrame and then the DF to a numpy matrix. Is there any better method than this?

But sadly, this does not help with saving memory .

Amrith Krishna
  • 2,768
  • 3
  • 31
  • 65
  • can you give an example of your default dict? and also the expected output? – Colonel Beauvel May 30 '16 at 11:59
  • When it comes to accessing values that are already present, a `defaultdict` is the same as a regular `dict`. What are the keys and values of this `dict` like? Words or indexes? What kind of array layout do you want? If @Eric's answer does not fit, give us a small example - of the dictionary and desired array (possibly sparse). – hpaulj May 30 '16 at 17:07
  • On building a sparse matrix from a dictionary of dictionaries. http://stackoverflow.com/questions/27770906/why-are-lil-matrix-and-dok-matrix-so-slow-compared-to-common-dict-of-dicts – hpaulj May 30 '16 at 17:16

1 Answers1

1

Assuming your data looks something like this:

data = defaultdict(int)
data[0,0] = 10
data[1,1] = 100

You want to use scipy.sparse.coo_matrix:

items = list(data.items())  # list only needed for python3
vs = [v for (i,j), v in items]
ii = [i for (i,j), v in items]
j j= [j for (i,j), v in items]
matrix = scipy.sparse.coo_matrix((vs, (ii, jj))

Which gives slightly strange output:

>>> print matrix
  (0, 0)    10
  (1, 1)    100

But you can treat this object as though it were a dense matrix

Eric
  • 95,302
  • 53
  • 242
  • 374
  • 1
    There is also a `dok` sparse format, which a `dict` subclass. The keys are `(i,j)` tuples. I found in other SO questions that the fastest way to add values to a `dok` is with an `update` from another dictionary. – hpaulj May 30 '16 at 16:58