Analysis of data that cannot fit into memory

Question

I have a database which has raw text that needs to be analysed. For example, I have collected the title tags of hundreds of millions of individual webpages and clustered them based on topic. I am now interested in performing some additional tests on subsets of each topic cluster. The problem is two-fold. First, I cannot fit all of the text into memory to evaluate it. Secondly, I need run several of these analyses in parallel, so even if I could fit a subset into memory, I certainly could not fit many subsets into memory.

I have been working with generators, but often it is necessary to know information about rows of data that have already been loaded and evaluated.

My question is this: What are the best methods for handling and analysing data that cannot fit into memory. The data necessarily must be extracted from some sort of database (currently mysql but likely will be switching to a more powerful solution soon.)

I am building the software that handles the data in Python.

Thank you,

EDIT

I will be researching and brainstorming on this all day and plan on continuing to post my thoughts and findings. Please leave any input or advice you might have.

IDEA 1: Tokenize words and n-grams and save to file. For each string pulled from database, tokenize using tokens in an already existing file. If a token does not exist, create it. For each word token, combine from right to left until a single representation of all the words in a string exists. Search an existing list (that can fit in memory) that consists of reduced tokens to find potential matches and similarities. Each reduced token will contain an identifier that indicates token categories. If a reduced token (one that was created by combination of word tokens) is found to match categorically against a tokenized string of interest, but not directly, then the reduced token will be broken down into its counterparts and compared word-token by word-token to the string of interest.

I have no idea if there already exists a library or module that can do this, nor am I sure how much benefit I will gain from it. However, my priorities are: 1) conserve memory, 2) worry about runtime. Thoughts?

EDIT 2

Hadoop is definitely going to be the solution to this problem. I found some great resources on natural language processing in python and hadoop. See below:

Thanks for your help!

score 3 · Accepted Answer · answered Jul 17 '12 at 16:27

3

Map/Reduce was created for this purpose.

The best map reduce engine is Hadoop, but it has a high learning curve and needs many nodes for it to be worth it. If this is a small project, you could use MongoDB, which is a really easy to use database and includes an internal map reduce engine which uses Javascript. The map reduce framework is really simple and easy to learn, but it lacks all the tools that you could get in the JDK using Hadoop.

WARNING: You can only run one map reduce job at a time on MongoDB's map reduce engine. This is alright for chaining jobs or medium datasets (<100GB), but it lacks Hadoop's parallelism.

answered Jul 17 '12 at 16:27

Moox

1,122
9
23

1

Upvoted. You need to conceptually break your algorithm into distinct steps, each of which has its own input, working storage and output. Each step should work on a distinct partition of your data. Working storage is the only data where you need worry about its in-memory footprint. Don't be afraid of building all the meta-models and group-bys/summaries you need for each step of your algorithm - disk is a lot cheaper than RAM. – Andrew Alcock Jul 18 '12 at 02:13
This is perfect. Hadoop is definitely looking like the solution to this problem. – Peter Kirby Jul 19 '12 at 03:52

score 0 · Answer 2 · edited May 23 '17 at 11:56

0

currently mysql but likely will be switching to a more powerful solution soon.

Please don't worse time - for most types tasks tunned MySQL is the best solution.

For processing huge data massives use iteratools or Build a Basic Python Iterator

About how iterate data. It depends from you algorithm.

edited May 23 '17 at 11:56

Community

1
1

answered Jul 17 '12 at 15:47

b1_

2,069
1
27
39

Analysis of data that cannot fit into memory

2 Answers2