Handling duplicate data with pandas

Question

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv file with pandas and want to remove duplicates. I've tried everything and problem is still there.

import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")

## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)

countries = dataframe.loc[:, ['Retailer country', 'Continent']] 

countries.head(6)

Output of this will be:

 Retailer country Continent
-----------------------------
0 United States    North America
1 Canada           North America
2 Japan                    Asia
3 Italy                   Europe
4 Canada           North America
5 United States    North America
6 France                  Europe

I want to be able to drop duplicate values based on columns from a dataframe above so I would have smth like this unique values from each country, and continent so that desired output of this will be:

 Retailer country Continent
-----------------------------
0 United States    North America
1 Canada           North America
2 Japan                    Asia
3 Italy                   Europe
4 France                  Europe

I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?

countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)

It's not clear what the last block of code is trying to do - it doesn't contain the `drop_duplicates` function you mention (which would seem to be the answer to this problem), and seems like the only effect would be to create a new DataFrame with the `Retailer country` / `Continent` columns renamed as `a`/`b`, and all the values combined into 1 row for each column ... — Toby Petty, Nov 18 '19 at 21:54
Have you looked into the docs? There is a subset function `df.drop_duplicates(subset='Retailer Country')` — Umar.H, Nov 18 '19 at 21:54
@Datanovice is correct. Just don't forget to reassign back to df. — Scott Boston, Nov 18 '19 at 22:01
Thank you all for your advices and I understand that I can use functions like drop_duplicates which would solve my problem but the problem is in this line : df = pd.DataFrame({'a':[country], 'b':[continent]}), when I output this line using df.head it returned no rows but now it throws an error "module 'pandas' has no attribute 'Dataframe'" — newnick988888, Nov 18 '19 at 22:23

score 2 · Accepted Answer · answered Nov 19 '19 at 09:52

It seems like a simple group-by could solve your problem.

import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
                   'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
                   'continent': [na, na, a, e, na, na, e]})

df.groupby(['country', 'continent']).agg('count').reset_index()

The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

Ty, this solved my problem now I get unique values ordered by country :) — newnick988888, Nov 19 '19 at 22:49

Handling duplicate data with pandas

Output of this will be:

1 Answers1