Close to serial textfile reading performance in MySQL

Question

I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.

I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)

select * from table is too slow and memory devastating. I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:

select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999

and so on...

Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?

I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.

score 1 · Accepted Answer · answered Dec 10 '10 at 14:46

You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.

In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.

score 0 · Answer 2 · answered Dec 10 '10 at 14:45

0

Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.

answered Dec 10 '10 at 14:45

Dan Grossman

51,866
10
112
101

score 0 · Answer 3 · answered Dec 10 '10 at 15:17

First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)

If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.

Greetz, J.

Close to serial textfile reading performance in MySQL

3 Answers3