Here are a few quick performance improvements I managed to get:
Using a plain dict
instead of defaultdict
, and changing d[parts[0]].append(parts[1])
to d[parts[0]] = d.get(parts[0], []) + [parts[1]]
, cut the time by 10%. I don't know whether it's eliminating all those calls to a Python __missing__
function, not mutating the lists in-place, or something else that deserves the credit.
Just using setdefault
on a plain dict
instead of defaultdict
also cuts the time by 8%, which implies that it's the extra dict work rather than the in-place appends.
Meanwhile, replacing the split()
with split(None, 1)
helps by 9%.
Running in PyPy 1.9.0 instead of CPython 2.7.2 cut the time by 52%; PyPy 2.0b by 55%.
If you can't use PyPy, CPython 3.3.0 cut the time by 9%.
Running in 32-bit mode instead of 64-bit increased the time by 170%, which implies that if you're using 32-bit you may want to switch.
The fact that the dict takes over 2GB to store (slightly less in 32-bit) is probably a big part of the problem. The only real alternative is to store everything on disk. (In a real-life app you'd probably want to manage an in-memory cache, but here, you're just generating the data and quitting, which makes things simpler.) Whether this helps depends on a number of factors. I suspect that on a system with an SSD and not much RAM it'll speed things up, while on a system with a 5400rpm hard drive and 16GB of RAM (like the laptop I'm using at the moment) it won't… But depending on your system's disk cache, etc., who knows, without testing.
There's no quick&dirty way to store lists of strings in disk-based storage (shelve
will presumably waste more time with the pickling and unpickling than it saves), but changing it to just concatenate strings instead and using gdbm
kept the memory usage down below 200MB and finished in about the same time, and has the nice side effect that if you want to use the data more than once, you've got them stored persistently. Unfortunately, plain old dbm
wouldn't work because the default page size is too small for this many entries, and the Python interface doesn't provide any way to override the default.
Switching to a simple sqlite3 database that just has non-unique Key and Value columns and doing it in :memory:
took about 80% longer, while on disk it took 85% longer. I suspect that denormalizing things to store multiple values with each key wouldn't help, and would in fact make things worse. (Still, for many real life uses, this may be a better solution.)
Meanwhile, wrapping cProfile
around your main loop:
40000002 function calls in 31.222 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 21.770 21.770 31.222 31.222 <string>:2(<module>)
20000000 2.373 0.000 2.373 0.000 {method 'append' of 'list' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
20000000 7.079 0.000 7.079 0.000 {method 'split' of 'str' objects}
So, that's one third of your time spent in string.split
, 10% spent in append
, and the rest spend it code that cProfile
couldn't see, which here includes both iterating the file and the defaultdict
method calls.
Switching to a regular dict
with setdefault
(which, remember, was a little faster) shows 3.774 seconds spent in setdefault
, so that's about 15% of the time, or presumably around 20% for the defaultdict
version. Presuambly the __setitem__
method isn't going to be any worse than the setdefault
or defaultdict.__getitem__
were.
However, we may not be seeing the time charged by malloc calls here, and they may be a huge chunk of the performance. To test that, you'll need a C-level profiler. So let's come back to that.
Meanwhile, at least some of the leftover time is probably taken up by the line-splitting as well, since that must cost on the same order as space-splitting, right? But I don't know of any way to improve that significantly.
Finally, a C-level profiler is going to help here, but one run on my system may not help much for your system, so I'll leave that to you.
The fastest version on my system depends on which Python I run, but it's either this:
d = {}
for line in fin:
parts = line.split(None, 1)
d[parts[0]] = d.get(parts[0], []) + [parts[1]]
Or this:
d = {}
for line in fin:
parts = line.split(None, 1)
d.setdefault(parts[0], []).append(parts[1])
… And they're both pretty close to each other.
The gdbm solution, which was about the same speed, and has obvious advantages and disadvantages, looks like this:
d = gdbm.open(sys.argv[1] + '.db', 'c')
for line in fin:
parts = line.split(None, 1)
d[parts[0]] = d.get(parts[0], '') + ',' + parts[1]
(Obviously if you want to be able to run this repeatedly, you will need to add a line to delete any pre-existing database—or, better, if it fits your use case, to check its timestamp against the input file's and skip the whole loop if it's already up-to-date.)