python - drop duplicated index in place in a pandas dataframe

Question

I have a list of dataframes:

all_df = [df1, df2, df3]

I would like to remove rows with duplicated indices in all dataframes in the list, such that the changes are reflected in the original dataframes df1, df2 and df3. I tried to do

for df in all_df:
    df = df[~df.index.duplicated()]

But the changes are only applied in the list, not on the original dataframes.

Essentially, I want to avoid doing the following:

df1 = df1[~df1.index.duplicated()]
df2 = df2[~df2.index.duplicated()]
df3 = df3[~df3.index.duplicated()]
all_df = [df1,df2,df3]

answered here https://stackoverflow.com/questions/41812564/looping-through-a-list-of-pandas-dataframes and here https://stackoverflow.com/questions/44630805/pandas-loop-through-list-of-data-frames-and-change-index. As to why: `df` looks at a new frame each time in the loop, forgetting the old one. — , Feb 09 '22 at 10:17
Thanks. These other questions are indeed related but do not answer my question completely. — Camille, Feb 09 '22 at 10:33
As it stands they exactly answer your question. In case you didn't notice, the answer below repeats those things. — , Feb 09 '22 at 10:53

jezrael · Accepted Answer · 2022-02-09T10:29:33.307

1

You need recreate list of DataFrames:

all_df = [df[~df.index.duplicated()] for df in all_df]

Or:

for i, df in enumerate(all_df):
    all_df[i] = df[~df.index.duplicated()]

print (all_df[0])

EDIT: If name of dictionary is important use dictionary of DataFrames, but also inplace modification df1, df2 is not here, need select by keys of dicts:

d = {'price': df1, 'volumes': df2}

d  = {k: df[~df.index.duplicated()] for k, df in all_df.items()}

print (d['price'])

edited Feb 09 '22 at 10:29

answered Feb 09 '22 at 10:13

jezrael

822,522
95
1,334
1,252

Thanks. This will not remove rows in the original dataframes df1, df2, df3 though, right? It will only change what is in the list all_df. I would like to apply the changes also to the original dataframes df1, df2 and df3. – Camille Feb 09 '22 at 10:19
@Camille - but why? I see no reason. – jezrael Feb 09 '22 at 10:19
@Camille - you need forget for `df1, df2, df3` - working with list, use function for list. – jezrael Feb 09 '22 at 10:20
Because in the rest of my code, I sometimes do operations on all dataframes, in which case I use the list all_df, and sometimes do operations only on some of them. – Camille Feb 09 '22 at 10:21
@Camille - I think no reason for use it. instead `df1`, `df2` (it is forgotten) is necessary use `all_df[0]` , `all_df[1]` and use it instead `df1, df2` – jezrael Feb 09 '22 at 10:25
My original dataframes have explicit names that are helpful to me (for example, df1 is actually called prices, df2 volumes, and so on) so I would like to keep using these names instead of all_df[0], ... Writing this, it occurs to me that I could use a dict of dataframes instead of a list and give a name to the dataframes in that way – Camille Feb 09 '22 at 10:27
@Camille - understand, why not use dictionary of dict? – jezrael Feb 09 '22 at 10:28

python - drop duplicated index in place in a pandas dataframe

1 Answers1