By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
By grouping two columns I made some changes.
I generated a file using python, it resulted in 2 duplicate columns. How to remove duplicate columns from a dataframe?
It's probably easiest to use a groupby (assuming they have duplicate names too):
In [11]: df
Out[11]:
A B B
0 a 4 4
1 b 4 4
2 c 4 4
In [12]: df.T.groupby(level=0).first().T
Out[12]:
A B
0 a 4
1 b 4
2 c 4
If they have different names you can drop_duplicates
on the transpose:
In [21]: df
Out[21]:
A B C
0 a 4 4
1 b 4 4
2 c 4 4
In [22]: df.T.drop_duplicates().T
Out[22]:
A B
0 a 4
1 b 4
2 c 4
Usually read_csv
will usually ensure they have different names...
Transposing is a bad idea when working with large DataFrames. See this answer for a memory efficient alternative: https://stackoverflow.com/a/32961145/759442
This is the best I found so far.
remove = []
cols = df.columns
for i in range(len(cols)-1):
v = df[cols[i]].values
for j in range(i+1,len(cols)):
if np.array_equal(v,df[cols[j]].values):
remove.append(cols[j])
df.drop(remove, axis=1, inplace=True)
It's already answered here python pandas remove duplicate columns.
Idea is that df.columns.duplicated()
generates boolean vector where each value says whether it has seen the column before or not. For example, if df
has columns ["Col1", "Col2", "Col1"]
, then it generates [False, False, True]
. Let's take inversion of it and call it as column_selector
.
Using the above vector and using loc
method of df
which helps in selecting rows and columns, we can remove the duplicate columns. With df.loc[:, column_selector]
we can select columns.
column_selector = ~df.columns().duplicated()
df = df.loc[:, column_selector]
I understand that this is an old question, but I recently had this same issue and none of these solutions worked for me, or the looping suggestion seemed a bit overkill. In the end, I simply found the index of the undesirable duplicate column and dropped that column index. So provided you know the index of the column this will work (which you could probably find via debugging or print statements):
df.drop(df.columns[i], axis=1)
The fast solution for dataset without NANs:
share = 0.05
dfx = df.sample(int(df.shape[0]*share))
dfx = dfx.T.drop_duplicates().T
df = df[dfx.columns]