2

I have a bunch of dataframes that I am trying to slice and assign back to the original names. But I am finding that there is a namespace issue. Below is what I have.

import pandas as pd
import numpy as np

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist =[df_a, df_b]

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts = dfts[idx:]
        print(dfts.head)

truncate_before(mylist, 11)
print(df_a)

In the print statements within the truncate_before function, it shows 3 rows, corresponding to the 11th, 12th and 13th row. But the final print statement shows 0th to 13th rows.

So outside the function, it reverts back to the original dataframes. I was under the impression that Python passes arguments by reference. What am I missing?

Spinor8
  • 1,587
  • 4
  • 21
  • 48

1 Answers1

1

In truncate_before:

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts = dfts[idx:]
        print(dfts.head)

the for-loop creates a local variable dfts which references the DataFrames in list_of_dfts. But

        dfts = dfts[idx:]

reassigns a new value to dfts. It does not change the DataFrame in list_of_dfts.

See Facts and myths about Python names and values for a great explanation of how variable names bind to values, and what operations change values versus binding new values to variable names.

Here are a number of alternatives:

Modify the list inplace

def truncate_before(list_of_dfts, idx):
    list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
    for dfts in list_of_dfts:
        print(dfts.head)

since assigning to list_of_dfts[:] (which calls list_of_dfts.__setitem__) changes the contents of list_of_dfts in-place.


import numpy as np
import pandas as pd

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist = [df_a, df_b]

def truncate_before(list_of_dfts, idx):
    list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]

print(mylist[0])
truncate_before(mylist, 11)
print(mylist[0])

shows mylist[0] has been truncated. Note that df_a still references the original DataFrame, however.


Return the list and reassign mylist or df_a, df_b to the result

Using return values may make it unnecessary to modify mylist in-place.

To reassign the global variables df_a, df_b to a new values, you could make truncate_before return the list of DataFrames, and reassign df_a and df_b to the returned value:

def truncate_before(list_of_dfts, idx):
    return [dfts[idx:] for dfts in list_of_dfts]

mylist = truncate_before(mylist, 11)   # or
# df_a, df_b = truncate_before(mylist, 11) # or
# mylist = df_a, df_b = truncate_before(mylist, 11)  

But note that it is probably not good to access the DataFrames through both mylist and df_a and df_b, since as the example above shows, the values do not stay coordinated automagically. Using mylist should suffice.


Use a DataFrame method with the inplace parameter, such as df.drop

dfts.drop (with inplace=True) modifies dfts itself:

import numpy as np
import pandas as pd

df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))

mylist = [df_a, df_b]

def truncate_before(list_of_dfts, idx):
    for dfts in list_of_dfts:
        dfts.drop(dfts.index[:idx], inplace=True)

truncate_before(mylist, 11)
print(mylist[0])
print(df_a)

By modifying dfts inplace, both the values in mylist and df_a and df_b get changed at the same time.

Note that dfts.drop drops rows based on index label value. So the above relies (assumes) that dfts.index is unique. If dfts.index is not unique, dfts.drop may more rows than idx rows. For example,

df = pd.DataFrame([1,2], index=['A', 'A'])
df.drop(['A'], inplace=True)

drops both rows making df an empty DataFrame.

Note also this warning from Pandas' core developer regarding the use of inplace:

My personal opinion: I never use in-place operations. The syntax is harder to read and its does not offer any advantages.

This is probably because under the hood, dfts.drop creates a new dataframe and then calls the _update_inplace private method to assign the new data to the old DataFrame:

def _update_inplace(self, result, verify_is_copy=True):
    """
    replace self internals with result.
    ...
    """
    self._reset_cache()
    self._clear_item_cache()
    self._data = getattr(result,'_data',result)
    self._maybe_update_cacher(verify_is_copy=verify_is_copy)

Since the temporary result had to be created anyway, there is no memory or performance benefit of "in-place" operations over simple reassignment.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Ok, mylist was a construct to group together all the individual dataframes. Is there nothing else I can do in terms of the original dataframes? Of course, I could just do it one by one. df_a = df_a[idx:], etc. But a programmatic way would be nice. Let me finish reading the article you recommended. – Spinor8 Jan 30 '16 at 14:48
  • `df.drop(..., inplace=True)` does modify `df` inplace, but due to the way inplace operations are implemented in Pandas, there is no real advantage to doing this over the more straight-forward reassignment to variable names. Personally I prefer functions that return values over functions that modify values, since with the former the assignment syntax makes it utterly clear what is getting modified. – unutbu Jan 30 '16 at 15:23