1

I am trying to remove non-consecutive duplicated words and numbers from the column names.

E.g. I currently have df['Weeks with more than 60 hours 60'] and I want to get df['Weeks with more than 60 hours']

I tested

df.columns = df.columns.str.split().apply(lambda x:OrderedDict.fromkeys(x).keys()).str.join(' ')

following Python Dataframe: Remove duplicate words in the same cell within a column in Python

But I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-85-1078b4f07191> in <module>()
     31     df_t.columns = df_t.columns.str.replace(r"."," ")
     32     df_t.columns = df_t.columns.str.strip()
---> 33     df_t.columns = df_t.columns.str.split().apply(lambda x:OrderedDict.fromkeys(x).keys()).str.join(' ')
     34 
     35 #     df_t.columns = df_t.columns.str.replace(r"\(.*\)","")

AttributeError: 'Index' object has no attribute 'apply'

Suggestions?

Filippo Sebastio
  • 1,112
  • 1
  • 12
  • 23

1 Answers1

1

Use list comprehension or map:

df = pd.DataFrame(columns=['What is is name name name'])

from collections import OrderedDict
df.columns = [' '.join(OrderedDict.fromkeys(x).keys()) for x in df.columns.str.split()]
print (df)
Empty DataFrame
Columns: [What is name]
Index: []

df.columns = (df.columns.str.split()
                .map(lambda x:OrderedDict.fromkeys(x).keys())
                .str.join(' '))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Is it possible that with your suggestion it actually removed words that were repeated across column name rather than within column names? 'Workers' is a terms that appear in most of column names and now I can't find it anymore – Filippo Sebastio Mar 13 '19 at 07:28
  • @FilippoSebastio - Not sure if understand, can you change `df = pd.DataFrame(columns=['What is is name name name', 'another col col'])` like you need with expected output? – jezrael Mar 13 '19 at 07:31