3

Beginner here, I am trying to isolate the names of neighborhoods from a dataframe of Toronto based on a cluster value I've assigned them. Instead of a list of 3 unique items, I end up with a list 2363 items long.

Neigh_List = []
for n in toronto_merged['Cluster Labels']:

        if n == 7 :
        x = toronto_merged['Neighborhood']
        Neigh_List.append(x) if x not in Neigh_List else None      


        
               
Neigh_List

[0                                                                                                Parkwoods
 1                                                                                                Parkwoods
 2                                                                                         Victoria Village
 3                                                                                         Victoria Village
 4                                                                                         Victoria Village
                                                        ...                                                
 2359    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2360    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2361    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2362    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 2363    Mimico NW , The Queensway West , South of Bloor , Kingsway Park South West , Royal York South West
 Name: Neighborhood, Length: 2364, dtype: object]
SeaBean
  • 22,547
  • 3
  • 13
  • 25
Drakosfire
  • 33
  • 3
  • Using the advice (which I ran into but wasn't understanding until it was applied here) my code looks like this. Thank you! neigh_list = set() for a in toronto_merged['Cluster Labels']: if a == 7: for a in toronto_merged['Neighborhood']: neigh_list.add(x) neigh_list – Drakosfire May 13 '21 at 23:13

2 Answers2

2

In general, looping over Pandas dataframes should be avoided for larger datasets (~1000+) as Pandas built-in vectorized functions are often faster (See this other stackoverflow post).

You could try something like:

neigh_list = list(toronto_merged.loc[toronto_merged['Neighborhood'] == 7]]['Neighborhood'].unique())

Additionally, if you want to avoid duplicates in a list, you could use python sets (see 5.4 at the time of writing).

unique_elements = set()
for x in some_iterable:
    unique_elements.add(x)

Or, using a set comprehension:

unique_elements = {unique_item for unique_item in some_iterable}
jrbergen
  • 660
  • 5
  • 16
  • 1
    Yes you were just ahead of me :) (and I sneakily made some edits/corrected some mistakes after having posted as well). Quite amazing (in a good way) that it's often harder to be the first to answer than to have your own questions answered. – jrbergen May 13 '21 at 18:49
  • 1
    You went an extra mile though ;) – Prayson W. Daniel May 13 '21 at 18:51
1

Have you tried using Pandas’ own power. Select all rows where Cluster Label equals 7, get the unique Neighborhoods?


...
Neigh_List = toronto_merged.loc[lambda d: d['Cluster Labels'].eq(7)]['Neighborhood'].unique().tolist()

# instead of .unique(), you can also do .drop_duplicates() which is faster
Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57