2

I have a collection of documents that I want to break up into words. This is an example collection:

In [1]:
import pandas as pd
docs = pd.DataFrame({'docno' : ['doc1', 'doc2'], 'doc' : ['to be or not to be', 'that is the question']})
docs

Out[1]:
   doc                   docno
0  to be or not to be    doc1
1  that is the question  doc2

I couldn't find a direct way to "unroll" each row above into a series of rows as below, but have seen that this is done with tolist() and then stack():

In [2]:
terms = pd.DataFrame(docs.doc.str.split().tolist(), index=docs.docno).stack()
terms = terms.reset_index()[[0, 'docno']]
terms.columns = ['term', 'docno']
terms

Out[2]:
   term      docno
0  to        doc1
1  be        doc1
2  or        doc1
3  not       doc1
4  to        doc1
5  be        doc1
6  that      doc2
7  is        doc2
8  the       doc2
9  question  doc2

I suspect that is not optimised so that the stacking is done in conjunction, but rather that an intermediary table of N dimensions (where N is the size of the vocabulary) is written for the split(), before it is being stacked. The documents can be extremely large and so is the vocabulary over the whole document collection. So this seems rather inefficient. I'm currently doing this transformation for ~5GB collection, which is at the moment using about 2.5TB of RAM (thus far); this seems to confirm my suspicion that the split/stack operations are actually done in series.

My question is: is there a better way to do this?

bop
  • 23
  • 5
  • (1) Very jealous that you have 2.5TB of RAM to work with (2) Iterators should always be considered when you have memory issues. Have you made any attempts to develop an iterative algorithm for this? – David Z Mar 29 '17 at 06:14

1 Answers1

0

I think you can use numpy solution - first split column docs, then str.len for get length of lists which are repeated by numpy.repeat with flattening lists:

from  itertools import chain

docs.doc = docs.doc.str.split()
df1 = pd.DataFrame({
        "docno": np.repeat(docs.docno.values, docs.doc.str.len()),
        "term": list(chain.from_iterable(docs.doc))})[['term','docno']]

print (df1)
       term docno
0        to  doc1
1        be  doc1
2        or  doc1
3       not  doc1
4        to  doc1
5        be  doc1
6      that  doc2
7        is  doc2
8       the  doc2
9  question  doc2
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252