Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError
. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat
and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!