1

So I have a DataFrame that has multiple column names that are the same. I firstly separate these columns and create a DataFrame that has these duplicated column names. What I was wondering, is there anyway to double check if all of the values in the rows have the same value as each other? In which case, I know that one column is an exact duplicate of another and I can drop it?

Below is what I have gotten so far and an example DF:

import collections
duplicated_columns = [item for item, count in collections.Counter(checking_empty_list).items() if count > 1]

dupe[duplicated_columns].head()

That is a simple way to obtain the columns that appear twice in a dataframe and then I am just displaying the head. I get the following df below:

enter image description here

In this case, ID would be all true and we can delete one of the columns, however total wouldn't be true and we won't delete any as the values are different. I thought of doing but it is unable to differentiate. How would I do it for multiple duplicated columns which could differ each time?

if df['ID'] == df['ID'].all() == True:
    df.drop(['ID'])
else:
    pass

Here is the code to get it into a df in Python.

ID,ID,TOTAL,TOTAL
0.0,0.0,290,0.00
0.0,0.0,16,60.16
0.0,0.0,0,0.00
0.0,0.0,87,0.87
0.0,0.0,78,0.78
df = pd.read_clipboard(sep=',')
df = df.rename(columns = {'TOTAL.1':'TOTAL', 'ID.1':'ID'})

Any help would be appreciated.

Nhyi
  • 373
  • 1
  • 12
  • Does this answer your question? [python pandas remove duplicate columns](https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns) – K450 Nov 22 '21 at 13:28
  • 1
    The transpose method seems to be good but some of my dataframes are around 300,000 rows in length which means that it would hang and not work. I sdaw dropping the duplicated column names, however in my case it would mean that a column which doesn't have identical column values would be dropped. – Nhyi Nov 22 '21 at 13:34
  • what's wrong with [the second solution in this answer](https://stackoverflow.com/a/32961145/3944322)? (in modern panda you need to import `array_equivalent` from `pandas.core.dtypes.missing`) – Stef Nov 22 '21 at 16:17
  • Tried using it but had no luck and the dataframe was returning a dataframe with no values. However, with the df.T.drop_duplicates().T worked fine still. The second solution only returns a list of the duplicated columns and not the actual dataframe. – Nhyi Nov 23 '21 at 10:13

2 Answers2

1

Ok, as the transpose alternative didn't work. I've tried your code.

to use all() you should delimit the expression with (); the drop method is not in place and the axis should be equals 1 (column).

if (df.loc[:,'ID'] == df.loc[:,"ID.1"]).all():
    df.drop('ID.1', axis= 1,inplace=True)
else:
    pass

df
Out[42]: 
   ID  TOTAL  TOTAL.1
0   0    290     0.00
1   0     16    60.16
2   0      0     0.00
3   0     87     0.87
4   0     78     0.78
0

You can transpose the DataFrame and then use drop_duplicates.

df.T.drop_duplicates()
Out[6]: 
           0      1    2      3      4
ID       0.0   0.00  0.0   0.00   0.00
TOTAL  290.0  16.00  0.0  87.00  78.00
TOTAL    0.0  60.16  0.0   0.87   0.78

To get the desired table, just transpose it back:

new_df = df.T.drop_duplicates() 
new_df.T
Out[8]: 
    ID  TOTAL  TOTAL
0  0.0  290.0   0.00
1  0.0   16.0  60.16
2  0.0    0.0   0.00
3  0.0   87.0   0.87
4  0.0   78.0   0.78
  • This is definitely the way I want to go, however the issue is that with large dataframes, this calculation takers too long and is inefficient. – Nhyi Nov 23 '21 at 10:14