Duplicates when appending string to list from dataframe with common column value

Question

Beginner here, I am trying to isolate the names of neighborhoods from a dataframe of Toronto based on a cluster value I've assigned them. Instead of a list of 3 unique items, I end up with a list 2363 items long.

Neigh_List = []
for n in toronto_merged['Cluster Labels']:

        if n == 7 :
        x = toronto_merged['Neighborhood']
        Neigh_List.append(x) if x not in Neigh_List else None      


        
               
Neigh_List

[0                                                                                                Parkwoods
 1                                                                                                Parkwoods
 2                                                                                         Victoria Village
 3                                                                                         Victoria Village
 4                                                                                         Victoria Village
                                                        ...                                                
 2359    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2360    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2361    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2362    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2363    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 Name: Neighborhood, Length: 2364, dtype: object]

Using the advice (which I ran into but wasn't understanding until it was applied here) my code looks like this. Thank you! neigh_list = set() for a in toronto_merged['Cluster Labels']: if a == 7: for a in toronto_merged['Neighborhood']: neigh_list.add(x) neigh_list — Drakosfire, May 13 '21 at 23:13

jrbergen · Accepted Answer · 2021-05-13T18:40:48.127

2

In general, looping over Pandas dataframes should be avoided for larger datasets (~1000+) as Pandas built-in vectorized functions are often faster (See this other stackoverflow post).

You could try something like:

neigh_list = list(toronto_merged.loc[toronto_merged['Neighborhood'] == 7]]['Neighborhood'].unique())

Additionally, if you want to avoid duplicates in a list, you could use python sets (see 5.4 at the time of writing).

unique_elements = set()
for x in some_iterable:
    unique_elements.add(x)

Or, using a set comprehension:

unique_elements = {unique_item for unique_item in some_iterable}

edited May 13 '21 at 18:40

answered May 13 '21 at 18:37

jrbergen

660
5
16

1

Yes you were just ahead of me :) (and I sneakily made some edits/corrected some mistakes after having posted as well). Quite amazing (in a good way) that it's often harder to be the first to answer than to have your own questions answered. – jrbergen May 13 '21 at 18:49
1

You went an extra mile though ;) – Prayson W. Daniel May 13 '21 at 18:51

Prayson W. Daniel · Answer 2 · 2021-05-13T18:39:52.510

1

Have you tried using Pandas’ own power. Select all rows where Cluster Label equals 7, get the unique Neighborhoods?


...
Neigh_List = toronto_merged.loc[lambda d: d['Cluster Labels'].eq(7)]['Neighborhood'].unique().tolist()

# instead of .unique(), you can also do .drop_duplicates() which is faster

edited May 13 '21 at 18:39

answered May 13 '21 at 18:34

Prayson W. Daniel

14,191
4
51
57

Duplicates when appending string to list from dataframe with common column value

2 Answers2