Remove columns that have substring similar to other columns Python

Question

I have dataframe where the column names have the same format: data_sensor, where the date is in the format of yymmdd. Here is a subset of it:

Considering the last data (180722), I would like to keep the column according to sensor pre-defined priority. For example, I would like to define that SN1 is more important than SK3. So the desired result would be the same dataframe, only without column 180722_SK3. The number of columns with the same last date can be more than two.

This is the solution I implemented:

sensorsImportance = ['SN1', 'SK3'] #list of importence, first item is the most important
sensorsOrdering = {word: i for i, word in enumerate(sensorsImportance)}


def remove_duplicate_last_date(df,sensorsOrdering):
    s = []
    lastDate = df.columns.tolist()[-1].split('_')[0]
    for i in df.columns.tolist():
        if lastDate in i:
            s.append(i.split('_')[1])

    if len(s)>1: 
        keepCol = lastDate +'_'+sorted(s, key=sensorsOrdering.get)[0]
        dropCols = [lastDate +'_'+i for i in sorted(s, key=sensorsOrdering.get)[1:]]
        df.drop(dropCols,axis=1,inplace=True)
        
    return df

It works fine, however, I feel that this is too cumbersome, is there a better way?

Here's [how to make good reproducible examples in pandas](https://stackoverflow.com/q/20109391/6692898) — RichieV, Aug 02 '20 at 14:54
Does this answer your question? [sorting by a custom list in pandas](https://stackoverflow.com/questions/23482668/sorting-by-a-custom-list-in-pandas) — RichieV, Aug 02 '20 at 14:57

score 0 · Answer 1 · answered Aug 02 '20 at 15:01

0

It can be done, with split the column then apply the argsort with the list , then reorder your dataframe , and join back the columns after groupby get the first value by date

df.columns=df.columns.str.split('_').map(tuple)
sensorsImportance = ['SN1', 'SK3']
idx=df.columns.get_level_values(1).map(dict(zip(sensorsImportance,range(len(sensorsImportance))))).argsort()
df=df.iloc[:,idx].T.groupby(level=0).head(1).T
df.columns=df.columns.map('_'.join)

answered Aug 02 '20 at 15:01

BENY

317,841
20
164
234

Hi @YOBEN_S, thank you. However, one major and one minor thing are not adequate to my problem. The minor thing is that the returned `df` is not ordered by columns name as the original `df` is, but this can be easily solved with `df.loc[:,sorted(df.columns.tolist())]`. The major thing is that I might have duplicate dates in the `df` which I would like to keep (remove only the last column/s in case there are duplicate dates in it), hence the `groupby` removes also duplicate dates that are not in the last column, duplicate dates which I would like to keep. – user88484 Aug 03 '20 at 06:57

Remove columns that have substring similar to other columns Python

1 Answers1