I have a collection of documents that I want to break up into words. This is an example collection:
In [1]:
import pandas as pd
docs = pd.DataFrame({'docno' : ['doc1', 'doc2'], 'doc' : ['to be or not to be', 'that is the question']})
docs
Out[1]:
doc docno
0 to be or not to be doc1
1 that is the question doc2
I couldn't find a direct way to "unroll" each row above into a series of rows as below, but have seen that this is done with tolist() and then stack():
In [2]:
terms = pd.DataFrame(docs.doc.str.split().tolist(), index=docs.docno).stack()
terms = terms.reset_index()[[0, 'docno']]
terms.columns = ['term', 'docno']
terms
Out[2]:
term docno
0 to doc1
1 be doc1
2 or doc1
3 not doc1
4 to doc1
5 be doc1
6 that doc2
7 is doc2
8 the doc2
9 question doc2
I suspect that is not optimised so that the stacking is done in conjunction, but rather that an intermediary table of N dimensions (where N is the size of the vocabulary) is written for the split(), before it is being stacked. The documents can be extremely large and so is the vocabulary over the whole document collection. So this seems rather inefficient. I'm currently doing this transformation for ~5GB collection, which is at the moment using about 2.5TB of RAM (thus far); this seems to confirm my suspicion that the split/stack operations are actually done in series.
My question is: is there a better way to do this?