I've spent the day trying to debug a memory problem in my Python script. I'm using SQL Alchemy as my ORM. There are several confounding issues here, and I'm hoping that if I list them all out, somebody will be able to point me in the right direction.
In order to achieve the performance I'm looking for, I read in all the records in a table (~400k), then loop through a spreadsheet, match the records I've previously read in, then create new records (~800k) into another table. Here's roughly what the code looks like:
dimensionMap = {}
for d in connection.session.query(Dimension):
dimensionMap[d.businessKey] = d.primarySyntheticKey
# len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB
allfacts = []
sheet = open_spreadsheet(path)
for row in sheet.allrows():
dimensionId = dimensionMap[row[0]]
metric = row[1]
fact = Fact(dimensionId, metric)
connection.session.add(fact)
allfacts.append(fact)
if row.number % 20000 == 0:
connection.session.flush()
# len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB
connection.session.commit()
sys.stdout.write('All Done')
400k and 800k don't seem like especially big numbers to me, but I'm nonetheless running into memory problems a machine with 4GB of memory. This is really strange to me, as I ran sys.getsizeof on my two biggest collections, and they were both well under any size that would cause problems.
While trying to figure this out, I noticed that the script was running really, really slowly. So I ran a profile on it, hoping the results would lead me in the direction of the memory problem, and came up with two confounding issues.
First, 87% of the program time is spent in the commit, specifically on this line of code:
self.transaction._new[state] = True
This can be found in session.py:1367
. self.transaction._new
is an instance of weakref.WeakKeyDictionary()
. Why is weakref:261:__setitem__
taking up so much time?
Second, even when the program is done ('All Done' has been printed to stdout), the script continues, seemingly forever, with 2.2GB of memory used.
I've done some searching on weakrefs, but haven't seen anybody mention the performance issues I'm facing. Ultimately, there isn't a whole lot I can do about this, given it's buried deep in SQL Alchemy, but I'd still appreciate any ideas.
Key Learnings
As mentioned by @zzzeek, there's a lot of overhead required to maintain persistent objects. Here's a little graph to show the growth.
The trendline suggests that each persistent instance takes about 2KB of memory overhead, even though the instance itself is only 30 bytes. This actually brings me another thing I learned, which is to take sys.getsizeof
with a huge grain of salt.
This function only returns the shallow size of an object, and doesn't take into account any other objects that need to be there for the first object to make sense (__dict__
, for example). You really need to use something like Heapy to get a good understanding of the actual memory footprint of an instance.
The last thing I learned is that, when Python is on the verge of running out of memory, and is thrashing like crazy, weird stuff happens that shouldn't be taken as part of the problem. In my case, the massive slow-down, the profile pointing to the weakref creation, and the hangup after the program completed, are all effects of the memory issue. Once I stopped creating and keeping around persistent instances, and instead just kept around the objects' properties that I needed, all the other issues went away.