5

I'm trying to find patterns of words for a huge input. I was using a dictionary for this purpose, and after some hours the program crashed with MemoryError.

I modified the program. I created a database via MySQLdb and I inserted there the values of the pattern-index. So for every word I check if it is in the index and if not I write it into the index with a value. Problem is that the database approach is too slow.

I was wondering if there is any way to combine dictionaries and database for example:

if ram <90% usage:
    seek into dict
    append to dict
else:
    if not (seek into dict):
        seek into database
        append to database

Using a dictionary for the same purpose of inputting 100 kb of data takes ~1.5 sec

Using a database for the same input takes ~84 sec

Original input is 16 GB . I do not know yet how much it will take to process.

Noam Hacker
  • 4,671
  • 7
  • 34
  • 55

1 Answers1

0

short answer (detailed answer to come):

your use of MySQL was poor, you don't want to commit at all as you just use the database as an extension of memory. Just removing the commmit should give you a big improvement

better than using MySQL use leveldb (pip install leveldb) with sync = false

adjust the following values with your memory you want to use

  • block_cache_size = 512*1024*1024 #512Mo #the more important
  • write_buffer_size = 10*1024*1024 #10Mo

as you have a MemoryError that means you have a 32bits system it means that the total memory enable for a process can't be more than 4 Go so adjust the values to fit in min(your system memory,4Go)

Xavier Combelle
  • 10,968
  • 5
  • 28
  • 52