big item frequency count in Python

Question

I want to count MANY items in memory. I used counter but it got slower and slower when I have more data. It's quite expected because it is essentially a hashtable but I didn't know that it can be sooooo slow. These are some log lines of my program:

$ grep merg ex42.out
2013-03-13 12:47:07,544 - ex42.data - DEBUG - Complete merging 4175889 keys in 0.650000 seconds.
2013-03-13 12:47:24,996 - ex42.data - DEBUG - Complete merging 4135905 keys in 7.890000 seconds.
2013-03-13 13:13:33,577 - ex42.data - DEBUG - Complete merging 4159325 keys in 21.560000 seconds.
2013-03-13 13:13:40,822 - ex42.data - DEBUG - Complete merging 4140346 keys in 23.070000 seconds.
2013-03-13 13:14:04,972 - ex42.data - DEBUG - Complete merging 4187157 keys in 35.340000 seconds.
2013-03-13 13:14:18,744 - ex42.data - DEBUG - Complete merging 4205433 keys in 31.900000 seconds.
2013-03-13 13:14:34,457 - ex42.data - DEBUG - Complete merging 4255486 keys in 35.940000 seconds.
2013-03-13 13:14:51,988 - ex42.data - DEBUG - Complete merging 4220057 keys in 39.950000 seconds.
2013-03-13 13:15:15,714 - ex42.data - DEBUG - Complete merging 4215430 keys in 45.280000 seconds.
2013-03-13 13:15:32,742 - ex42.data - DEBUG - Complete merging 4232054 keys in 47.470000 seconds.
2013-03-13 13:51:28,386 - ex42.data - DEBUG - Complete merging 4244061 keys in 2187.990000 seconds.
2013-03-13 13:51:46,548 - ex42.data - DEBUG - Complete merging 4306790 keys in 2195.190000 seconds.

I'm still waiting for the next log line to come out........

I think the number of distinct items maybe in the order of 100 million but I have 192GB RAM so it may fit well in the memory. Is there any library that can help me in this case? I know trees, heaps and so on but implementing them is not only tiring but also tends to error. So I prefer to reuse some code if any.

Any suggestion will be appreciated!

UPDATE: I'm counting co-occurrences in a (bunch of) text corpus. Basically, I count words that occur next to each other in some GBs of text. There is also some filtering to reduce the number of items but you can imagine that it's going to be quite big. Each item is a tuple (word, context). I used the default hash function.

UPDATE 2: SQLite is not available. I tried this before and would love to use it but I don't have enough permission to install/fix it.

$ /opt/python/bin/python3
Python 3.2.1 (default, Nov 27 2012, 05:59:14) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/lib/python3.2/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/opt/python/lib/python3.2/sqlite3/dbapi2.py", line 26, in <module>
    from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()

$ ~/python2.7 
Python 2.7.3 (default, Nov 27 2012, 05:53:53) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/lib/python2.7/sqlite3/__init__.py", line 24, in <module>
    from dbapi2 import *
  File "/opt/python/lib/python2.7/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()

UPDATE 3: the code. I found a mistake that made the reported time more than what it really was. I'll update the log lines above as soon as I have new ones.

def _count_coocurrences(coocurrences_callback, corpus_paths, words, contexts, 
                       multiprocessing=False, suspend_exceptions=True):
    counter = Counter()
    try:
        # find co-occurrences
            counter[c] += 1
    except Exception as e:
        # prevent process from hanging
    return counter


class AsynchronousCounter:
    '''
    A counter that provides an asynchronous update function
    '''

    def update_locked(self, counter):
        start = time.clock() # I made mistake at this line
        if self.lock.acquire():
            try:
                self.counter.update(counter)
                elapsed = time.clock() - start
                if _log.isEnabledFor(DEBUG):
                    _log.debug("Completed merging %d keys in %f seconds."
                                %(len(counter), elapsed))
            finally:
                self.lock.release()

    def update_async(self, counter):
        thread = Thread(target=self.update_locked, args=(counter,))
        thread.start()
        self.threads.append(thread)

    def wait(self):
        for thread in self.threads:
            thread.join()


    # some other functions


counter = AsynchronousCounter()
pool = Pool(max_processes)
for path in corpus_paths:
    pool.apply_async(_count_coocurrences, 
                     args=(coocurrences_callback, path, words, contexts, True),
                     callback=counter.update_async)
pool.close()
pool.join()
counter.wait()

UPDATE 4: new statistics

I fixed the bug found in previous update by moving start = time.clock() into the synchronization block, immediately before self.counter.update() call. These are the newest results:

2013-03-14 12:30:54,888 - ex42.data - DEBUG - Completed merging 4140346 keys in 0.770000 seconds.
2013-03-14 12:56:47,205 - ex42.data - DEBUG - Completed merging 4135905 keys in 1536.090000 seconds.
2013-03-14 12:57:04,156 - ex42.data - DEBUG - Completed merging 4159325 keys in 18.250000 seconds.
2013-03-14 12:57:34,640 - ex42.data - DEBUG - Completed merging 4175889 keys in 30.760000 seconds.
2013-03-14 14:01:09,155 - ex42.data - DEBUG - Completed merging 4187157 keys in 3811.940000 seconds.
2013-03-14 14:01:51,244 - ex42.data - DEBUG - Completed merging 4220057 keys in 39.260000 seconds.
2013-03-14 14:02:07,782 - ex42.data - DEBUG - Completed merging 4215430 keys in 11.470000 seconds.
2013-03-14 14:02:40,478 - ex42.data - DEBUG - Completed merging 4205433 keys in 25.340000 seconds.
2013-03-14 14:42:48,693 - ex42.data - DEBUG - Completed merging 4232054 keys in 2371.140000 seconds.
2013-03-14 14:43:13,818 - ex42.data - DEBUG - Completed merging 4255486 keys in 12.360000 seconds.
2013-03-14 14:43:28,132 - ex42.data - DEBUG - Completed merging 4244061 keys in 11.990000 seconds.
2013-03-14 14:43:56,665 - ex42.data - DEBUG - Completed merging 4269879 keys in 23.470000 seconds.
2013-03-14 14:44:13,066 - ex42.data - DEBUG - Completed merging 4282191 keys in 11.810000 seconds.
2013-03-14 14:44:24,671 - ex42.data - DEBUG - Completed merging 4306790 keys in 11.320000 seconds.
2013-03-14 15:56:59,668 - ex42.data - DEBUG - Completed merging 4320573 keys in 4352.680000 seconds.
2013-03-14 15:57:09,125 - ex42.data - DEBUG - Completed merging 4130680 keys in 9.300000 seconds.
2013-03-14 15:57:18,628 - ex42.data - DEBUG - Completed merging 4104878 keys in 9.950000 seconds.
2013-03-14 15:57:27,747 - ex42.data - DEBUG - Completed merging 4095587 keys in 9.030000 seconds.
2013-03-14 15:59:29,345 - ex42.data - DEBUG - Completed merging 4088393 keys in 11.290000 seconds.
2013-03-14 17:23:36,209 - ex42.data - DEBUG - Completed merging 4082050 keys in 2374.850000 seconds.
2013-03-14 17:23:55,361 - ex42.data - DEBUG - Completed merging 4062960 keys in 13.840000 seconds.
2013-03-14 17:24:10,038 - ex42.data - DEBUG - Completed merging 4048144 keys in 12.140000 seconds.

What are your items in memory? How are they generated? Can you save to a database? — YXD, Mar 13 '13 at 12:56
No, it's not expected with hashtable. With hashtable you have O(1) access. You must be doing something wrong. — vartec, Mar 13 '13 at 12:59
@MrE Are you joke. His is spoke about inmemory operations but you suggest also drop all of these on disk. — Denis, Mar 13 '13 at 13:00
My first guess is that you are adding the entire logfile on each iteration. — wildplasser, Mar 13 '13 at 13:00
http://stackoverflow.com/questions/2442014/tree-libraries-in-python http://docs.python.org/2/library/heapq.html This should help — GodMan, Mar 13 '13 at 12:57
Is that a 64bit binary of CPython? Otherwise you would never be able to use all the RAM. Why do you use a hashtable? Is your hash function good, i.e. do you have few/any collisions? — Albert, Mar 13 '13 at 12:59
show you code it's really strange 4 mln records nothing for modern server. — Denis, Mar 13 '13 at 13:01
Can you monitor the memory usage of your program? Are you sure the hash table fits in memory? — Davide R., Mar 13 '13 at 13:26
@vartec: I think when there are too much collisions, the nice property of O(1) won't hold — minhle_r7, Mar 13 '13 at 13:26
@MrE: no I can't, that machine don't even have sqlite and I don't have permission to install anything — minhle_r7, Mar 13 '13 at 13:27
@Albert: I think it's 64bit CPython. Is there any standard way to write a hash function for my case (pair of text, many keys)? I did spend too much time on trial-and-error already :-( — minhle_r7, Mar 13 '13 at 13:30
@DavideR.: Let's say if I have 1 billion keys and can use up to 150GB RAM then I have 161 bytes for each key. Still, I'm not really sure but I couldn't install and use database so there ain't many choices. If it don't fit, perhaps I'll consider truncate some low-frequency items. — minhle_r7, Mar 13 '13 at 13:36
Yes, if you have too many collisions, O(1) won't hold. Which is exactly what I mean by something wrong, if you have just hashtable of counted elements, you shouldn't have too many collisions. — vartec, Mar 13 '13 at 13:59
Just FYI, the `sqlite3` library comes as a part of the core Python distribution, and you can run it with a completely in-memory database... — Silas Ray, Mar 13 '13 at 14:04
@vartec: yes, collections.Counter with default hash function — minhle_r7, Mar 13 '13 at 14:14
@sr2222: Well... I must have done something wrong. The last time I type "import sqlite3" into Python console, it threw out an error but it works now. Maybe I typed it wrong :-( Thank you for reminding. — minhle_r7, Mar 13 '13 at 14:14
dear @sr2222, I figured out that it works for Python 2.6 but not Python 2.7 and Python 3.2 on the server. It's not something that I can change... I can just use what they provided. — minhle_r7, Mar 13 '13 at 17:14
@ngọcminh.oss Please show some code for the `Counter`-based solution. I've seen it grow very slow in a previous program and it might be that you've made the same mistake that I made back then. — Fred Foo, Mar 13 '13 at 17:14
Strange, I just tried sqlite3 in a vitrualenv with vanilla 2.7 and it worked fine. — Silas Ray, Mar 13 '13 at 17:20
@sr2222: That module requires SQLite headers to be installed. If you don't have those, Python will be built without `_sqlite3`. — Fred Foo, Mar 13 '13 at 17:25
when having a close look at the code I found a mistake that make running time report so much bloated. I'll run the script again. — minhle_r7, Mar 13 '13 at 17:32
You could try to add some code to count for hash collisions just to check that out. Normally, the Python hash for strings and pairs should be good enough but maybe is it not for your case. — Albert, Mar 15 '13 at 08:35

big item frequency count in Python

0 Answers0