1

I am relatively new to Python, and even newer to pandas. I am trying to develope a simple web scraper to search Indeed for job postings. This is mostly about learning the language, but if I find a new job from it, all the better.

The nature of the data means there are going to be a lot of duplicates, and that is what I have seen so far. As a result, I wanted to remove the duplicates before sending the dataframe to a .csv file. I tried implementing the DataFrame.drop_duplicates() in the code i was working on, but it didnt work. So i created a seperate script to only test the drop.duplicates() method without having to go through all the other code first to make sure i got the syntax right and it functions as expected. This is what I have:

import pandas as pd
df=pd.DataFrame({'A':['1', '2', '3'], 'B':['1', '2', '4']})
print(df)
df1=df.drop_duplicates()
print(df1)

My expectation was that drop_duplicates() would remove the first two rows from df and assign the result to df1. Except, they both were the same.

So then I tried the following figuring the default index column applied by the DataFrame was interfering:

import pandas as pd
df=pd.DataFrame({'A':['1', '2', '3'], 'B':['1', '2', '4']})
print(df)
df1=df.drop_duplicates(subset=["A", "B"])
print(df1)

That didnt work either. There were a couple other iterations of the same code I tried involving 'keep' and 'inplace' but the result is always a dataframe that is the same as the original. What am I missing? I am expecting it to remove the first two rows since they are the same. Are they not? Or am I just expecting the wrong thing...

Tom
  • 25
  • 3
  • A `set` can remove duplicates. `>>> set([1,2,3,1,2,3,4]) == {1, 2, 3, 4}` – James Schinner Jan 29 '18 at 16:11
  • Are you trying this ? `df[df['A'] != df['B']]` – Bharath M Shetty Jan 29 '18 at 16:12
  • 1
    First two rows are not the same. They would be the same if they were both [1, 2], for example. But one of them is [1, 1] and the other one is [2, 2]. Why do you think they are the same? – ayhan Jan 29 '18 at 16:12
  • @ayhan I think they are the same because the row index is the same. Is a pandas dataframe structured differently than a standard mathematical matrix is? My expectation is that it will remove the first and second rows because both first row values are 1, and both second row values are 2. Clearly I am missing something... – Tom Jan 29 '18 at 16:25
  • When using the term duplicate in datasets, we generally refer the records being the same. (Same person is seen in the first row and in the fifth row: same name, same height, same sex etc.). In your example, attribute values are the same. Of course there might be use cases where you need to remove them too, but the term duplicate is generally understood this way. – ayhan Jan 29 '18 at 16:29

1 Answers1

1

There are no row-wise duplicates in your dataframe.

As per the documentation, duplicates are identified by row.

To remove rows where df['A'] == df['B'], you can just mask by a Boolean array: df[df['A'] != df['B']]

df = pd.DataFrame({'A':['1', '2', '3'], 'B':['1', '2', '4']})

df[df.A != df.B]
# A B
# 3 4 
jpp
  • 159,742
  • 34
  • 281
  • 339