I want to count MANY items in memory. I used counter but it got slower and slower when I have more data. It's quite expected because it is essentially a hashtable but I didn't know that it can be sooooo slow. These are some log lines of my program:
$ grep merg ex42.out
2013-03-13 12:47:07,544 - ex42.data - DEBUG - Complete merging 4175889 keys in 0.650000 seconds.
2013-03-13 12:47:24,996 - ex42.data - DEBUG - Complete merging 4135905 keys in 7.890000 seconds.
2013-03-13 13:13:33,577 - ex42.data - DEBUG - Complete merging 4159325 keys in 21.560000 seconds.
2013-03-13 13:13:40,822 - ex42.data - DEBUG - Complete merging 4140346 keys in 23.070000 seconds.
2013-03-13 13:14:04,972 - ex42.data - DEBUG - Complete merging 4187157 keys in 35.340000 seconds.
2013-03-13 13:14:18,744 - ex42.data - DEBUG - Complete merging 4205433 keys in 31.900000 seconds.
2013-03-13 13:14:34,457 - ex42.data - DEBUG - Complete merging 4255486 keys in 35.940000 seconds.
2013-03-13 13:14:51,988 - ex42.data - DEBUG - Complete merging 4220057 keys in 39.950000 seconds.
2013-03-13 13:15:15,714 - ex42.data - DEBUG - Complete merging 4215430 keys in 45.280000 seconds.
2013-03-13 13:15:32,742 - ex42.data - DEBUG - Complete merging 4232054 keys in 47.470000 seconds.
2013-03-13 13:51:28,386 - ex42.data - DEBUG - Complete merging 4244061 keys in 2187.990000 seconds.
2013-03-13 13:51:46,548 - ex42.data - DEBUG - Complete merging 4306790 keys in 2195.190000 seconds.
I'm still waiting for the next log line to come out........
I think the number of distinct items maybe in the order of 100 million but I have 192GB RAM so it may fit well in the memory. Is there any library that can help me in this case? I know trees, heaps and so on but implementing them is not only tiring but also tends to error. So I prefer to reuse some code if any.
Any suggestion will be appreciated!
UPDATE: I'm counting co-occurrences in a (bunch of) text corpus. Basically, I count words that occur next to each other in some GBs of text. There is also some filtering to reduce the number of items but you can imagine that it's going to be quite big. Each item is a tuple (word, context). I used the default hash function.
UPDATE 2: SQLite is not available. I tried this before and would love to use it but I don't have enough permission to install/fix it.
$ /opt/python/bin/python3
Python 3.2.1 (default, Nov 27 2012, 05:59:14)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/lib/python3.2/sqlite3/__init__.py", line 23, in <module>
from sqlite3.dbapi2 import *
File "/opt/python/lib/python3.2/sqlite3/dbapi2.py", line 26, in <module>
from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()
$ ~/python2.7
Python 2.7.3 (default, Nov 27 2012, 05:53:53)
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/lib/python2.7/sqlite3/__init__.py", line 24, in <module>
from dbapi2 import *
File "/opt/python/lib/python2.7/sqlite3/dbapi2.py", line 27, in <module>
from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()
UPDATE 3: the code. I found a mistake that made the reported time more than what it really was. I'll update the log lines above as soon as I have new ones.
def _count_coocurrences(coocurrences_callback, corpus_paths, words, contexts,
multiprocessing=False, suspend_exceptions=True):
counter = Counter()
try:
# find co-occurrences
counter[c] += 1
except Exception as e:
# prevent process from hanging
return counter
class AsynchronousCounter:
'''
A counter that provides an asynchronous update function
'''
def update_locked(self, counter):
start = time.clock() # I made mistake at this line
if self.lock.acquire():
try:
self.counter.update(counter)
elapsed = time.clock() - start
if _log.isEnabledFor(DEBUG):
_log.debug("Completed merging %d keys in %f seconds."
%(len(counter), elapsed))
finally:
self.lock.release()
def update_async(self, counter):
thread = Thread(target=self.update_locked, args=(counter,))
thread.start()
self.threads.append(thread)
def wait(self):
for thread in self.threads:
thread.join()
# some other functions
counter = AsynchronousCounter()
pool = Pool(max_processes)
for path in corpus_paths:
pool.apply_async(_count_coocurrences,
args=(coocurrences_callback, path, words, contexts, True),
callback=counter.update_async)
pool.close()
pool.join()
counter.wait()
UPDATE 4: new statistics
I fixed the bug found in previous update by moving start = time.clock() into the synchronization block, immediately before self.counter.update() call. These are the newest results:
2013-03-14 12:30:54,888 - ex42.data - DEBUG - Completed merging 4140346 keys in 0.770000 seconds.
2013-03-14 12:56:47,205 - ex42.data - DEBUG - Completed merging 4135905 keys in 1536.090000 seconds.
2013-03-14 12:57:04,156 - ex42.data - DEBUG - Completed merging 4159325 keys in 18.250000 seconds.
2013-03-14 12:57:34,640 - ex42.data - DEBUG - Completed merging 4175889 keys in 30.760000 seconds.
2013-03-14 14:01:09,155 - ex42.data - DEBUG - Completed merging 4187157 keys in 3811.940000 seconds.
2013-03-14 14:01:51,244 - ex42.data - DEBUG - Completed merging 4220057 keys in 39.260000 seconds.
2013-03-14 14:02:07,782 - ex42.data - DEBUG - Completed merging 4215430 keys in 11.470000 seconds.
2013-03-14 14:02:40,478 - ex42.data - DEBUG - Completed merging 4205433 keys in 25.340000 seconds.
2013-03-14 14:42:48,693 - ex42.data - DEBUG - Completed merging 4232054 keys in 2371.140000 seconds.
2013-03-14 14:43:13,818 - ex42.data - DEBUG - Completed merging 4255486 keys in 12.360000 seconds.
2013-03-14 14:43:28,132 - ex42.data - DEBUG - Completed merging 4244061 keys in 11.990000 seconds.
2013-03-14 14:43:56,665 - ex42.data - DEBUG - Completed merging 4269879 keys in 23.470000 seconds.
2013-03-14 14:44:13,066 - ex42.data - DEBUG - Completed merging 4282191 keys in 11.810000 seconds.
2013-03-14 14:44:24,671 - ex42.data - DEBUG - Completed merging 4306790 keys in 11.320000 seconds.
2013-03-14 15:56:59,668 - ex42.data - DEBUG - Completed merging 4320573 keys in 4352.680000 seconds.
2013-03-14 15:57:09,125 - ex42.data - DEBUG - Completed merging 4130680 keys in 9.300000 seconds.
2013-03-14 15:57:18,628 - ex42.data - DEBUG - Completed merging 4104878 keys in 9.950000 seconds.
2013-03-14 15:57:27,747 - ex42.data - DEBUG - Completed merging 4095587 keys in 9.030000 seconds.
2013-03-14 15:59:29,345 - ex42.data - DEBUG - Completed merging 4088393 keys in 11.290000 seconds.
2013-03-14 17:23:36,209 - ex42.data - DEBUG - Completed merging 4082050 keys in 2374.850000 seconds.
2013-03-14 17:23:55,361 - ex42.data - DEBUG - Completed merging 4062960 keys in 13.840000 seconds.
2013-03-14 17:24:10,038 - ex42.data - DEBUG - Completed merging 4048144 keys in 12.140000 seconds.