19

I guess this question needs some insight into the implementation of concat.

Say, I have 30 files, 1G each, and I can only use up to 32 G memory. I loaded the files into a list of DataFrames, called 'list_of_pieces'. This list_of_pieces should be ~ 30G in size, right?

if I do pd.concat(list_of_pieces), does concat allocate another 30G (or maybe 10G 15G) in the heap and do some operations, or it run the concatation 'in-place' without allocating new memory?

anyone knows this?

Thanks!

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
James Bond
  • 7,533
  • 19
  • 50
  • 64
  • 2
    I don't *think* it's inplace... as an aside, I don't think you actually want to read that much into memory (you're not going to leave much room for actually doing calculations)! I think [HDF5 store](http://pandas.pydata.org/pandas-docs/stable/io.html#hdf5-pytables) is a much better choice for you. – Andy Hayden Jun 07 '13 at 11:51
  • @AndyHayden, i am afraid i do need that size of data in memory, i need to so some interactive analysis on them :-( – James Bond Jun 07 '13 at 12:49

2 Answers2

16

The answer is no, this is not an in-place operation; np.concatenate is used under the hood, see here: Concatenate Numpy arrays without copying

A better approach to the problem is to write each of these pieces to an HDFStore table, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables for docs, and here: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore for some recipies.

Then you can select whatever portions (or even the whole set) as needed (by query or even row number)

Certain types of operations can even be done when the data is on-disk: https://github.com/pydata/pandas/issues/3202?source=cc, and here: http://pytables.github.io/usersguide/libref/expr_class.html#

Community
  • 1
  • 1
Jeff
  • 125,376
  • 21
  • 220
  • 187
0

Try this:

dfs = [df1, df2]

temp = pd.concat(dfs, copy=False, ignore_index=False)
    
df1.drop(df1.index[0:], inplace=True)

df1[temp.columns] = temp 
Jeremy Cochoy
  • 2,480
  • 2
  • 24
  • 39
  • 1
    Try adding code formatting for better readability – benicamera Oct 04 '22 at 07:36
  • I've tested your solution with 1.2Gb table. It's definitely slower. Such slow, that I had been waiting for 10 minutes, the script still was working. (using just pd.concat it takes 30 seconds) – vitperov Apr 24 '23 at 12:04