1

I have multiple dataframes, on which I want to run this function which mainly drops unnecessary columns from the dataframe and returns a dataframe:

def dropunnamednancols(df):
    """
    Drop any columns staring with unnamed and NaN

    Args:
        df ([dataframe]): dataframe of which columns to be dropped
    """
    
    #first drop nan columns
    df = df.loc[:, df.columns.notnull()]
    #then search for columns with unnamed 
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
    
    return df

Now I iterate over the list of dataframes: [df1, df2, df3]

dfsublist = [df1, df2, df3]
for index in enumerate(dfsublist):
    dfsublist[index] = dropunnamednancols(dfsublist[index])

Whereas the items of dfsublist have been changed, the original dataframes df1, df2, df3 still retain the unnecessary columns. How could I achieve this?

haphaZard
  • 43
  • 1
  • 6
  • 1
    If you have small amount of dfs, you can try `df1, df2, df3 = [dropunnamednancols(df) for df in dfsublist]`. – Mustafa Aydın Apr 27 '21 at 10:20
  • This is a good tip. I tried list comprehension by dfsublist = [dropunnamednancols(df) for df in dfsublist] which of course did not work out. Thanks – haphaZard Apr 27 '21 at 12:01

2 Answers2

3

If I understand correctly you want to apply a function to multiple dataframes seperately.

The underlaying issue is that in your function you return a new dataframe and replace the stored dataframe in the list with a new own instead of modifying the old orignal one.

If you want to modify the orignal one you have to use the inplace=True parameters of the pandas functions. This is possible, but not recommended, as seen here.

Your code could therefore look like this:

def dropunnamednancols(df):
    """
    Drop any columns staring with unnamed and NaN

    Args:
        df ([dataframe]): dataframe of which columns to be dropped
    """

    cols = [col for col in df.columns if (col is None) | (col.startswith('Unnamed'))]
    df.drop(cols, axis=1, inplace=True)

As example on sample data:

import pandas as pd
df_1 = pd.DataFrame({'a':[0,1,2,3], 'Unnamed':[9,8,7,6]})
df_2 = pd.DataFrame({'Unnamed':[9,8,7,6], 'b':[0,1,2,3]})

lst_dfs = [df_1, df_2]

[dropunnamednancols(df) for df in lst_dfs]

# df_1 
# Out[55]: 
#    a
# 0  0
# 1  1
# 2  2
# 3  3
# df_2
# Out[56]: 
#    b
# 0  0
# 1  1
# 2  2
# 3  3
Andreas
  • 8,694
  • 3
  • 14
  • 38
  • Yes, exactly this is what I wanted to achieve. Thanks for the tips! – haphaZard Apr 27 '21 at 11:28
  • Glad I could help. Could you please also mark the answer as 'accepted answer'? This can help future readers with a similar problem. Happy coding! – Andreas Apr 27 '21 at 12:07
0

The reason is probably because your are using enumerate wrong. In your case, you just want the index, so what you should do is:

for index in range(len(dfsublist)):
    ...

Enumerate returns a tuple of an index and the actual value in your list. So in your code, the loop variable index will actually be asigned:

(0, df1) # First iteration
(1, df2) # Second iteration
(2, df3) # Third iteration

So either, you use enumerate correctly and unpack the tuple:

for index, df in enumerate(dfsublist):
    ...

or you get rid of it altogether because you access the values with the index either way.

sunnytown
  • 1,844
  • 1
  • 6
  • 13
  • 1
    I don't think that is the problem here, he wants to change the original dataframes and is just using the list of dataframes to apply the function to them. – Andreas Apr 27 '21 at 10:14
  • Yes, Andreas is correct. I have tried both solution with enumerate as well. Problem is I want to change the original dataframes and with my approach only a copy is changed. – haphaZard Apr 27 '21 at 11:25