I am relatively new to Python, and even newer to pandas. I am trying to develope a simple web scraper to search Indeed for job postings. This is mostly about learning the language, but if I find a new job from it, all the better.
The nature of the data means there are going to be a lot of duplicates, and that is what I have seen so far. As a result, I wanted to remove the duplicates before sending the dataframe to a .csv file. I tried implementing the DataFrame.drop_duplicates() in the code i was working on, but it didnt work. So i created a seperate script to only test the drop.duplicates() method without having to go through all the other code first to make sure i got the syntax right and it functions as expected. This is what I have:
import pandas as pd
df=pd.DataFrame({'A':['1', '2', '3'], 'B':['1', '2', '4']})
print(df)
df1=df.drop_duplicates()
print(df1)
My expectation was that drop_duplicates() would remove the first two rows from df and assign the result to df1. Except, they both were the same.
So then I tried the following figuring the default index column applied by the DataFrame was interfering:
import pandas as pd
df=pd.DataFrame({'A':['1', '2', '3'], 'B':['1', '2', '4']})
print(df)
df1=df.drop_duplicates(subset=["A", "B"])
print(df1)
That didnt work either. There were a couple other iterations of the same code I tried involving 'keep' and 'inplace' but the result is always a dataframe that is the same as the original. What am I missing? I am expecting it to remove the first two rows since they are the same. Are they not? Or am I just expecting the wrong thing...