I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.
I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar
field (e.g. "en", "de", "es" etc..)
select * from table
is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:
select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999
and so on...
Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?
I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.