Creating a term-document matrix in Python from ElasticSearch index

Question

ElasticSearch newbie here. I have a set of text documents which I've indexed using ElasticSearch through the Python ElasticSearch client. Now I want to do some machine learning with the documents using Python and scikit-learn. I need to accomplish the following.

Use the ElasticSearch analyzers to process the text (stemming, lowercase, etc.)
Retrieve the processed documents (or analyzed tokens) from the index.
Convert the processed documents into a Term-Document Matrix for classification (perhaps using the CountVectorizer in scikit-learn). Or alternatively, maybe there's some way to retrieve a TDM straight from ElasticSearch.

I'm having trouble thinking about the correct way to go about this, and there doesn't seem to be any easy implementations from ElasticSearch.

For example, I can just retrieve the unanalyzed documents from ES and then process the documents in Python, but I want to utilize ES's analyzers. I can use ES's analyzers every time I query a set of documents from ES, but that seems like doing something twice since it should already be analyzed and stored in the index. Alternatively, I think I can tell ES to retrieve the term vectors for each document and manually extract the tokens and counts from the results for each document and then manually code up the TDM given the tokens and counts. That seems to be the most direct way that I can think of so far.

Are there any easier or more direct paths to get a TDM of the analyzed texts from an ES index into Python to work with machine learning packages?

score 1 · Accepted Answer · answered Aug 07 '15 at 01:04

I have recently added a tutorial on how I did this using Python.

Read after going through the tutorial:

If you are doing something at a large scale I suggest you to check out Apache Spark. The Sparse Matrix can be used as an input to Spark's MLlib's RowMatrix RDD. There will be Python support that soon, I guess so.

Creating a term-document matrix in Python from ElasticSearch index

1 Answers1