I have a database which has raw text that needs to be analysed. For example, I have collected the title tags of hundreds of millions of individual webpages and clustered them based on topic. I am now interested in performing some additional tests on subsets of each topic cluster. The problem is two-fold. First, I cannot fit all of the text into memory to evaluate it. Secondly, I need run several of these analyses in parallel, so even if I could fit a subset into memory, I certainly could not fit many subsets into memory.
I have been working with generators, but often it is necessary to know information about rows of data that have already been loaded and evaluated.
My question is this: What are the best methods for handling and analysing data that cannot fit into memory. The data necessarily must be extracted from some sort of database (currently mysql but likely will be switching to a more powerful solution soon.)
I am building the software that handles the data in Python.
Thank you,
EDIT
I will be researching and brainstorming on this all day and plan on continuing to post my thoughts and findings. Please leave any input or advice you might have.
IDEA 1: Tokenize words and n-grams and save to file. For each string pulled from database, tokenize using tokens in an already existing file. If a token does not exist, create it. For each word token, combine from right to left until a single representation of all the words in a string exists. Search an existing list (that can fit in memory) that consists of reduced tokens to find potential matches and similarities. Each reduced token will contain an identifier that indicates token categories. If a reduced token (one that was created by combination of word tokens) is found to match categorically against a tokenized string of interest, but not directly, then the reduced token will be broken down into its counterparts and compared word-token by word-token to the string of interest.
I have no idea if there already exists a library or module that can do this, nor am I sure how much benefit I will gain from it. However, my priorities are: 1) conserve memory, 2) worry about runtime. Thoughts?
EDIT 2
Hadoop is definitely going to be the solution to this problem. I found some great resources on natural language processing in python and hadoop. See below:
- http://www.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python
- http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf
- http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
- https://github.com/klbostee/dumbo/wiki/Short-tutorial
Thanks for your help!