Pandas expression causes column explosion (or how to delete columns that contain substring in duplicate names environment)

Question

I use the following pandas expression

df = df[df.columns[~df.columns.str.contains('Unnamed:')]]

to drop columns that contain Unnamed. I got this one from here Remove Unnamed columns in pandas dataframe

For some reason, in some cases, this line causes an explosion of columns e.g

df shape in (2000, 1451)
after dropping Unnamed (2000, 3851)

in particular, it seems like it causes an explosion in case some columns have the same name e.g duplicates.

Anyone knows why this happens and how to avoid it?

How do I drop columns that have certain substring in duplicate-name-allowed case? Thanks

piRSquared · Accepted Answer · 2019-06-24T14:33:09.110

3

You're slicing with names of columns when you clearly have repeated names. You want to slice using loc and a boolean mask.

df = df.loc[:, ~df.columns.str.contains('Unnamed:')]]

edited Jun 24 '19 at 14:33

answered Jun 24 '19 at 14:23

piRSquared

285,575
57
475
624

score 1 · Answer 2 · answered Jun 24 '19 at 14:24

1

I am recommended fixing the duplicated columns problem

s=df.columns.to_series()
s1=s.groupby(s).cumcount().astype(str)
newc=s+s1.mask(s1=='0','')
Out[717]: 
a     a
a    a1
b     b
dtype: object
df.columns=newc

answered Jun 24 '19 at 14:24

BENY

317,841
20
164
234

1

@YohanRoth adding a name count if unique nothing change, if duplicated adding the the count number to make it unique – BENY Jun 24 '19 at 14:29

Pandas expression causes column explosion (or how to delete columns that contain substring in duplicate names environment)

2 Answers2