Trouble creating a subset dataframe

Question

I am trying to better understand Python and why I am receiving an error.

I have a dataframe with country names and I want to filter the dataset to only show those that have no duplicates. I entered:

df[df['Country Name'].value_counts() == 1]

However, I get an error

Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

It seems that [df['Country Name'].value_counts() == 1] creates a list that also shows the country names along with the boolean, and not simply the boolean as I was expecting.

Also, I tried filtering only on one country, i.e., df[df['Country Name'] == 'United States'] and this works perfectly fine.

I'm just trying to understand why in one scenario it works and in the other it doesn't. I do notice in the latter that there is an index starting at 0, so perhaps that is why. I just don't understand why I don't receive this same index in the previous example.

Can somebody help with an explanation?

Hey, checkout the [how to ask a good question page](https://stackoverflow.com/help/how-to-ask). With an reproducible example much more people will be willing to help. Help us to help you. — Arthur Harduim, Dec 09 '20 at 02:06
please show us a small part of your dataframe and an input/output example — Pablo C, Dec 09 '20 at 02:42

score 3 · Answer 1 · answered Dec 09 '20 at 03:01

Your solution doesn't work because the resulting dataframe is shorter than the original, they have to be of the same length, then it can filter row by row depending of the boolean values.

Also, I'm pretty sure you're actually looking for pandas.DataFrame.drop_duplicates:

df.drop_duplicates(subset=['Country Name'], keep = False)

It literally drops duplicate values, in this case you drop by 'Country Name' and you don't want to keep neither the first or the last occurrence, which are the other options for keep, then keep = False.

Documentation here.

Prateek · Answer 2 · 2020-12-09T03:24:35.837

Here's an explanation..

You've provided for country name..

df[df['Country Name'] == 'United States']

Let's split this,

df['Country Name'] == 'United States'

gives you a series with as many values as the length of the original dataframe with boolean data.

Now, when you do

df[df['Country Name'] == 'United States']

you'll get the data frame containing only 'United States' because pandas directly compare the boolean and returns the row with a 'True'.

Now for value counts..

df[df['Country Name'].value_counts() == 1]

split this,

df['Country Name'].value_counts() == 1

will return only the unique country names and if their count is == 1 in boolean format. If you check the length, it doesn't match the original length of the df.

Now, when you try to subset the dataframe, you get the error you're getting.

The solution. Drop the countries appearing more than once like Pablo mentioned in his answer (I haven't tried it. Mind the keep = False). Or, try the below..

If you want the row with countries that are appearing only once, you can try the map way..

df[df['Country Name'].map(df['Country Name'].value_counts()) == 1]

This will return the data frame with countries that are appearing exactly once.

Or

df[df['Country Name'].isin(df['Country Name'].value_counts()[df['Country Name'].value_counts()==1].index)]

score 0 · Answer 3 · answered Dec 09 '20 at 03:23

Try this -

Sample data for dataframe

>>> df = pd.DataFrame ({"Country Name": ["United States", "Canada", "Spain", "France", "India", "Greece", "United States","Canada"], "Exports": ["beef","corn","cattle","cheese","cattle","cheese","pork","maple syrup"]})

Display dataframe

>>> df
    Country Name      Exports
0  United States         beef
1         Canada         corn
2          Spain       cattle
3         France       cheese
4          India       cattle
5         Greece       cheese
6  United States         pork
7         Canada  maple syrup

Use groupby() in addition to count() to return counts by "Country Name"

>>> df.groupby("Country Name")["Country Name"].count()
Country Name
Canada           2
France           1
Greece           1
India            1
Spain            1
United States    2
Name: Country Name, dtype: int64

Display only the row count() == 1

>>> df[df['Country Name'].map(df.groupby('Country Name')["Country Name"].count()) == 1]
  Country Name Exports
2        Spain  cattle
3       France  cheese
4        India  cattle
5       Greece  cheese

Trouble creating a subset dataframe

3 Answers3