0

I need to remove duplicate column in pandas where all the values are same across all records dynamically.

for example:

df:

Id  ProductName  ProductSize ProductSize  ProductDesc  Quantity SoldCount  Sales
1   Shoes        9            9           Shoes         143     143         6374
2   Bag          XL           XL          Bag           342     342         2839
3   Laptop       16INCH       16INCH      Laptop        452     452         8293
4   Shoes        9            9           Shoes         143     143         3662
5   Laptop       14INCH       14INCH      Laptop        452     452         7263

In the above column you can see there are some duplicate columns with exact same name and there are duplicate values across all records under different column name. I am trying to remove those columns. By default I am keeping first occurred column.

df_output:

Id  ProductName  ProductSize Quantity Sales
1   Shoes        9           143     6374
2   Bag          XL          342     2839
3   Laptop       16INCH      452     8293
4   Shoes        9           143     3662
5   Laptop       14INCH      452     7263
  • If you need an efficient method, I recommend to have a look at the answer I provided in the duplicate (based on the columns hash). – mozway May 14 '23 at 11:45

2 Answers2

1

Approach 1 - Uses Transpose It then finds the duplicate columns using the duplicated() method, keeping only the first occurrence. Next, it obtains the unique column names and transposes the DataFrame back to its original form, keeping only the unique columns. Finally, it assigns the resulting DataFrame to df_output.

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the IDs contain duplicates? In this updated version, the index is reset at the beginning using df.reset_index(inplace=True) to convert the ID column into a regular column. After removing the duplicate columns, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

By resetting and reassigning the index, you ensure that duplicate IDs are preserved in the resulting DataFrame.

# Reset the index to convert the Id column to a regular column
df.reset_index(inplace=True)

# Transpose the DataFrame to make columns as rows
transposed_df = df.transpose()

# Find duplicate columns (excluding the first occurrence)
duplicate_columns = transposed_df.duplicated(keep='first')

# Get the unique column names
unique_columns = transposed_df[~duplicate_columns].index

# Transpose the DataFrame back and keep only the unique columns
df_output = df[unique_columns].copy()

# Set the Id column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)

Approach 2 - utilizes the nunique() method to identify columns with only one unique value

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Print the resulting DataFrame
print(df_output)

If the Ids are duplicated? after keeping only the unique columns, we identify the duplicate IDs using df_output.index.duplicated(). Then, we reset the index to convert the ID column into a regular column and remove the rows with duplicate IDs using df_output[~df_output['Id'].duplicated()]. Finally, the ID column is set as the index again using df_output.set_index('Id', inplace=True).

This way, you can handle duplicate IDs while removing duplicate columns based on the uniqueness of the values.

# Get the counts of unique values per column
value_counts = df.apply(lambda x: x.nunique())

# Filter columns with only one unique value
unique_columns = value_counts[value_counts > 1].index

# Keep only the unique columns
df_output = df[unique_columns].copy()

# Identify duplicate IDs
duplicate_ids = df_output.index[df_output.index.duplicated()]

# Reset index for duplicate IDs
df_output.reset_index(inplace=True)

# Remove duplicate IDs from the DataFrame
df_output = df_output[~df_output['Id'].duplicated()]

# Set the ID column as the index again
df_output.set_index('Id', inplace=True)

print(df_output)
dimButTries
  • 661
  • 7
  • 15
0

use Transpose to use drop_duplicates

df.T.drop_duplicates().T
Panda Kim
  • 6,246
  • 2
  • 12
  • 2
    Transpose is relatively expensive and will mess up the dtypes as the input has mixed type. While concise, this is really not a good approach IMO. – mozway May 14 '23 at 11:46