1

For a given dataframe df, imported from a csv file and containing redundant data (columns), I would like to write a function that allows to perform recursive filtering and sub-sequent renaming of df.columns, based on the amount of arguments given.

Ideally the function should perform as follows. When input is (df, 'string1a', 'string1b', 'new_col_name1'), then:

filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
df_out = df [ filter1]
df_out.columns= ['new_col_name1']
return df_out

Whereas, when input is: (df, 'string1a', 'string1b', 'new_col_name1','string2a', 'string2b', 'new_col_name2', 'string3a', 'string3b', 'new_col_name3')the function should return

filter1 = [col for col in df.columns if 'string1a' in col and 'string1b' in col]
filter2 = [col for col in df.columns if 'string2a' in col and 'string2b' in col]
filter3 = [col for col in df.columns if 'string3a' in col and 'string3b' in col]

df_out = df [ filter1 + filter2 + filter3 ]
df_out.columns= ['new_col_name1','new_col_name2','new_col_name3']
return df_out
DavidG
  • 24,279
  • 14
  • 89
  • 82
Andreuccio
  • 1,053
  • 2
  • 18
  • 32
  • What happens if multiple columns meet the criteria, are you OK with all of them having the same name? In fact, if they are multiple, your last statement won't work. How would you handle that? – elPastor Dec 12 '17 at 11:40
  • the two strings filtering criteria aim at avoiding that more than 1 column could pass, so I would be OK with that. If there ll be instances where this does not apply (there won't be many), I'd be happy to filter those manually, i.e. not using a recursive function – Andreuccio Dec 12 '17 at 11:44

1 Answers1

1

I think you can use dictionary for define values and then apply function with np.logical_and.reduce because need check multiple values in list:

df = pd.DataFrame({'aavfb':list('abcdef'),
                   'cedf':[4,5,4,5,5,4],
                   'd':[7,8,9,4,2,3],
                   'c':[1,3,5,7,1,0],
                   'abds':[5,3,6,9,2,4],
                   'F':list('aaabbb')})

print (df)
   F aavfb  abds  c  cedf  d
0  a     a     5  1     4  7
1  a     b     3  3     5  8
2  a     c     6  5     4  9
3  b     d     9  7     5  4
4  b     e     2  1     5  2
5  b     f     4  0     4  3

def rename1(df, d):
    #loop in dict
    for k,v in d.items():
        #get mask for columns contains all values in lists
        m = np.logical_and.reduce([df.columns.str.contains(x) for x in v])         
        #set new columns names by mask 
        df.columns = np.where(m, k, df.columns)

    #filter all columns by keys of dict
    return df.loc[:, df.columns.isin(d.keys())]

d = {'new_col_name1':['a', 'b'],
     'new_col_name2':['c', 'd']}   

print (rename1(df, d))

   new_col_name1  new_col_name1  new_col_name2
0              a              5              4
1              b              3              5
2              c              6              4
3              d              9              5
4              e              2              5
5              f              4              4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252