2

I have a data frame that I initialize out of scope of a local method. I would like to do as follows:

def outer_method():
    ... do outer scope stuff here
    df = pd.DataFrame(columns=['A','B','C','D'])
    def recursive_method(arg):
        ... do local stuff here
        # func returns a data frame to be appended to empty data frame
        results_df = func(args)
        df.append(results_df, ignore_index=True)
        return results
recursive_method(arg)
return df

However, this does NOT work. The df is always empty if I append to it this way.

I found the answer to my problem here: appending-to-an-empty-data-frame-in-pandas... this works, IF the empty DataFrame object is in scope of the method, but not for my case. As per @DSM's comment "but the append doesn't happen in-place, so you'll have to store the output if you want it:"

IOW, I would need to have something like:

df = df.append(results_df, ignore_index=True)

in my local method, but this doesn't help me get access to my outer scope variable df to append to it.

Is there a way to make this happen in place? This works fine with the python extend method for extending the contents of a list object (I realize DataFrames are not lists, but...). Is there an analogous way to do this with a DataFrame object without having to deal with my scoping issues for df?

Btw, the Pandas concat method also works, but I run into the issue of variable scope.

Community
  • 1
  • 1
horcle_buzz
  • 2,101
  • 3
  • 30
  • 59

1 Answers1

2

In Python3, you could use the nonlocal keyword:

def outer_method():
    ... do outer scope stuff here
    df = pd.DataFrame(columns=['A','B','C','D'])
    def recursive_method(arg):
        nonlocal df
        ... do local stuff here
        # func returns a data frame to be appended to empty data frame
        results_df = func(args)
        df = df.append(results_df, ignore_index=True)
        return results

return df

But note that calling df.append returns a new DataFrame each time and thus requires copying all the old data into the new DataFrame. If you do this inside a loop N times, you end up making on the order of 1+2+3+...+N = O(N^2) copies -- very bad for performance.


If you do not need df inside recursive_method for any purpose other than appending, it is better to append to a list, and then construct the DataFrame (by calling pd.concat once) after recursive_method is done:

df = pd.DataFrame(columns=['A','B','C','D'])
data = [df]
def recursive_method(arg, data):
    ... do stuff here
     # func returns a data frame to be appended to empty data frame
     results_df = func(args)
     data.append(df_join_out)
     return results
recursive_method(arg, data)
df = pd.concat(data, ignore_index=True)

This is the best solution if all you need to do is collect data inside recursive_method and can wait to construct the new df after recursive_method is done.


In Python2, if you must use df inside recursive_method, then you could pass df as argument to recursive_method, and return df too:

df = pd.DataFrame(columns=['A','B','C','D'])
def recursive_method(arg, df):
    ... do stuff here
     results, df = recursive_method(arg, df)
     # func returns a data frame to be appended to empty data frame
     results_df = func(args)
     df = df.append(results_df, ignore_index=True)
     return results, df
results, df = recursive_method(arg, df)

but be aware that you will be paying a heavy price doing the O(N^2) copying mentioned above.


Why DataFrames can not should not be appended to in-place:

The underlying data in a DataFrame is stored in NumPy arrays. The data in a NumPy array comes from a contiguous block of memory. Sometimes there is not enough space to resize the NumPy arrays to a larger contigous block of memory even if memory is available -- imagine the array being sandwiched in between other data structures. In that case, in order to resize the array, a new larger block of memory has to be allocated somewhere else and all the data from the original array has to be copied to the new block. In general, it can't be done in-place.

DataFrames do have a private method, _update_inplace, which could be used to redirect a DataFrame's underlying data to new data. This is only a pseudo-inplace operation, since the new data (think NumPy arrays) has to be allocated (with all the attendant copying) first. So using _update_inplace has two strikes against it: it uses a private method which (in theory) may not be around in future versions of Pandas, and it incurs the O(N^2) copying penalty.

In [231]: df = pd.DataFrame([[0,1,2]])

In [232]: df
Out[232]: 
   0  1  2
0  0  1  2

In [233]: df._update_inplace(df.append([[3,4,5]]))

In [234]: df
Out[234]: 
   0  1  2
0  0  1  2
0  3  4  5
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for the explanation. It makes sense. I would definitely prefer to not pass `df` as a argument to my recursive method (or use the **nonlocal keyword**) for that exact reason. Also, I **HAD** been using a list, but I was switching back and forth between lists and data frames, which was costly in performance, so since I was using data frames to do set operations, I thought I would circumvent going back and forth between these two object types until the end, when I returned the final results via JSON. But, I think your suggestion to use a list for `concat` is a good compromise. – horcle_buzz Feb 18 '16 at 23:24