52

I have a pandas data frame which looks like this.

  Column1  Column2 Column3
0     cat        1       C
1     dog        1       A
2     cat        1       B

I want to identify that cat and bat are same values which have been repeated and hence want to remove one record and preserve only the first record. The resulting data frame should only have.

  Column1  Column2 Column3
0     cat        1       C
1     dog        1       A
cs95
  • 379,657
  • 97
  • 704
  • 746
Sayonti
  • 946
  • 2
  • 8
  • 14
  • 8
    `df.drop_duplicates(['Column1', 'Column2'])` – piRSquared Jun 16 '18 at 05:01
  • I am looking for something that will match the values in the two particular columns and then drop not for the entire data frame @piRSquared – Sayonti Jun 16 '18 at 05:04
  • Did you look into `subset` option in `drop_duplicates` ? – niraj Jun 16 '18 at 05:08
  • Something like `df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')` – niraj Jun 16 '18 at 05:11
  • df = repair[repair.duplicated(subset=['SERIAL_NO','PART_NO'], keep=False)] print (df) – Sayonti Jun 16 '18 at 05:11
  • I did and thats not the solution as that drops both the columns entirely. So I am looking for something that would just match the values within the two columns and then remove the row. @student – Sayonti Jun 16 '18 at 05:14
  • 3
    you need `keep='first'` which is the default. `keep=False` is wrong – piRSquared Jun 16 '18 at 05:14
  • 1
    It returns as expected and yes it needs `keep='first` http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html Also, you are using duplicated which only keeps duplcates, instead need drop_duplicates. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.duplicated.html – niraj Jun 16 '18 at 05:15

4 Answers4

88

Using drop_duplicates with subset with list of columns to check for duplicates on and keep='first' to keep first of duplicates.

If dataframe is:

df = pd.DataFrame({'Column1': ["'cat'", "'toy'", "'cat'"],
                   'Column2': ["'bat'", "'flower'", "'bat'"],
                   'Column3': ["'xyz'", "'abc'", "'lmn'"]})
print(df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
2   'cat'     'bat'   'lmn'

Then:

result_df = df.drop_duplicates(subset=['Column1', 'Column2'], keep='first')
print(result_df)

Result:

  Column1   Column2 Column3
0   'cat'     'bat'   'xyz'
1   'toy'  'flower'   'abc'
niraj
  • 17,498
  • 4
  • 33
  • 48
8
import pandas as pd

df = pd.DataFrame({"Column1":["cat", "dog", "cat"],
                    "Column2":[1,1,1],
                    "Column3":["C","A","B"]})

df = df.drop_duplicates(subset=['Column1'], keep='first')
print(df)
zafrin
  • 434
  • 4
  • 11
  • add 'Column2' as well inside subset parameter. – Jay Dangar Jun 16 '18 at 05:46
  • 1
    While this code snippet may be the solution, [including an explanation](https://meta.stackexchange.com/questions/114762/explaining-entirely-%E2%80%8C%E2%80%8Bcode-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Narendra Jadhav Jun 16 '18 at 06:47
  • 1
    I agree. I will try to do that. Thanks, Narendra! – zafrin Jun 16 '18 at 14:41
1

Inside the drop_duplicates() method of Dataframe you can provide a series of column names to eliminate duplicate records from your data.

The following "Tested" code does the same :

import pandas as pd

df = pd.DataFrame()
df.insert(loc=0,column='Column1',value=['cat',     'toy',    'cat'])
df.insert(loc=1,column='Column2',value=['bat',    'flower',  'bat'])
df.insert(loc=2,column='Column3',value=['xyz',     'abc',    'lmn'])

df = df.drop_duplicates(subset=['Column1','Column2'],keep='first')
print(df)

Inside of the subset parameter, you can insert other column names as well and by default it will consider all the columns of your data and you can provide keep value as :-

  • first : Drop duplicates except for the first occurrence.
  • last : Drop duplicates except for the last occurrence.
  • False : Drop all duplicates.
Mr. T
  • 11,960
  • 10
  • 32
  • 54
Jay Dangar
  • 3,271
  • 1
  • 16
  • 35
1

Use drop_duplicates() by using column name

import pandas as pd
data = pd.read_excel('your_excel_path_goes_here.xlsx')
#print(data)
data.drop_duplicates(subset=["Column1"], keep="first")

keep=first to instruct Python to keep the first value and remove other columns duplicate values.

keep=last to instruct Python to keep the last value and remove other columns duplicate values.

Suppose we want to remove all duplicate values in the excel sheet. We can specify keep=False