In truncate_before
:
def truncate_before(list_of_dfts, idx):
for dfts in list_of_dfts:
dfts = dfts[idx:]
print(dfts.head)
the for-loop
creates a local variable dfts
which references the DataFrames in list_of_dfts
. But
dfts = dfts[idx:]
reassigns a new value to dfts
. It does not change the DataFrame in list_of_dfts
.
See Facts and myths about Python names and values for a great explanation of how variable names bind to values, and what operations change values versus binding new values to variable names.
Here are a number of alternatives:
Modify the list inplace
def truncate_before(list_of_dfts, idx):
list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
for dfts in list_of_dfts:
print(dfts.head)
since assigning to list_of_dfts[:]
(which calls list_of_dfts.__setitem__
) changes the contents of list_of_dfts
in-place.
import numpy as np
import pandas as pd
df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
mylist = [df_a, df_b]
def truncate_before(list_of_dfts, idx):
list_of_dfts[:] = [dfts[idx:] for dfts in list_of_dfts]
print(mylist[0])
truncate_before(mylist, 11)
print(mylist[0])
shows mylist[0]
has been truncated. Note that df_a
still references the original DataFrame, however.
Return the list and reassign mylist
or df_a, df_b
to the result
Using return values may make it unnecessary to modify mylist
in-place.
To reassign the global variables df_a
, df_b
to a new values, you could make
truncate_before
return the list of DataFrames, and reassign df_a
and df_b
to the returned value:
def truncate_before(list_of_dfts, idx):
return [dfts[idx:] for dfts in list_of_dfts]
mylist = truncate_before(mylist, 11) # or
# df_a, df_b = truncate_before(mylist, 11) # or
# mylist = df_a, df_b = truncate_before(mylist, 11)
But note that it is probably not good to access the DataFrames through both mylist
and df_a
and df_b
, since as the example above shows, the values do not stay coordinated automagically. Using mylist
should suffice.
Use a DataFrame method with the inplace parameter, such as df.drop
dfts.drop
(with inplace=True
) modifies dfts
itself:
import numpy as np
import pandas as pd
df_a = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
df_b = pd.DataFrame(np.random.rand(14,2), columns = list('XY'))
mylist = [df_a, df_b]
def truncate_before(list_of_dfts, idx):
for dfts in list_of_dfts:
dfts.drop(dfts.index[:idx], inplace=True)
truncate_before(mylist, 11)
print(mylist[0])
print(df_a)
By modifying dfts
inplace, both the values in mylist
and df_a
and df_b
get changed at the same time.
Note that dfts.drop
drops rows based on index label value. So the above relies
(assumes) that dfts.index
is unique. If dfts.index
is not unique,
dfts.drop
may more rows than idx
rows. For example,
df = pd.DataFrame([1,2], index=['A', 'A'])
df.drop(['A'], inplace=True)
drops both rows making df
an empty DataFrame.
Note also this warning from Pandas' core developer regarding the use of inplace
:
My personal opinion: I never use in-place operations. The syntax is harder to
read and its does not offer any advantages.
This is probably because under the hood, dfts.drop
creates a new dataframe and
then calls the _update_inplace
private method to assign the new data to the
old DataFrame:
def _update_inplace(self, result, verify_is_copy=True):
"""
replace self internals with result.
...
"""
self._reset_cache()
self._clear_item_cache()
self._data = getattr(result,'_data',result)
self._maybe_update_cacher(verify_is_copy=verify_is_copy)
Since the temporary result
had to be created anyway, there is no memory or performance benefit of "in-place" operations over simple reassignment.