0

I have a dataframe

 Unnamed geo_zones_pickups  Delstat             region
4   4    ['52102' '53101']  ['52102' '53101']   [52102, 53101]
5   5    ['52102']          ['52102']           [52102]

and those there lists:

north = ['51101', '51105', '51100']
midt = ['52103', '52102', '52104', '52101', '52150']
south = ['53101', '53102', '53106', '53103', '53105', '53104']

I now need to create a new column 'geo_region' with string value taken from one of the list names variables. Desired output will look like this:

 Unnamed geo_zones_pickups  Delstat             region          geo_region
4   4    ['52102' '53101']  ['52102' '53101']   [52102, 53101]  midt,south
5   5    ['52102']          ['52102']           [52102]         midt

So basically, whenever any of the regions is present in any of the declared list append the list name. All geographical regions have 3 first letter representing the appropriate geo_region. For norththey will always start with 511 and then followed by two other digits. For midtthey will always start with 521 and south531. I tried several option the last one with intersection which works great but I can only match on one of the lists. How can I make it more robust to handle all three lists at the same time and assign a proper value to the new column? I greatly appreciate any help.

Edit: geo_zones_pickups and Delstatare an object types with elements in it stored as string.region also an object with elements in it stored as list of strings

szuszfol
  • 117
  • 1
  • 9
  • What are the objects in `geo_zones_pickups` and `Delstat`? Lists? strings? other? Ideally provide a [minimal reproducible example](https://stackoverflow.com/q/20109391/16343464) as `DataFrame` constructor. – mozway Apr 25 '23 at 18:38

1 Answers1

1

You can match first 3 letters for each value of list by dictionary for regions in nested list comprehension:

d = {'511':'north', '521':'midt','531':'south'}

df['geo_region'] = [','.join([d[y[:3]] for y in x if y[:3] in d]) for x in df['region']]
print (df)
   Unnamed  geo_zones_pickups            Delstat          region  geo_region
4        4  ['52102' '53101']  ['52102' '53101']  [52102, 53101]  midt,south
5        5          ['52102']          ['52102']         [52102]        midt

If want match exact values from 3 lists create helper d1 dictionary and flatten:

d1 = {'north':north, 'midt':midt,'south':south}
d = {x: k for k, v in d1.items() for x in v}

df['geo_region'] = [','.join([d[y] for y in x if y in d]) for x in df['region']]
print (df)
   Unnamed  geo_zones_pickups            Delstat          region  geo_region
4        4  ['52102' '53101']  ['52102' '53101']  [52102, 53101]  midt,south
5        5          ['52102']          ['52102']         [52102]        midt
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This works wonderful! Thank you so much. One follow-up question. Some `region`will have multiple occurrences from the same `geo_region`eg [52102, 52103,53101]. This will produce `midt,midt,south`. How I can make those geo_regions unique? that regardless of how many different regions occur it only assign one label to the `geo_region`so that the output will be `midt,south`? – szuszfol Apr 26 '23 at 07:10
  • 1
    @szuszfol - Can you check `df['geo_region'] = [','.join(set([d[y] for y in x if y in d])) for x in df['region']]` if order is not important? – jezrael Apr 26 '23 at 07:18
  • 1
    @szuszfol - If order is important use `df['geo_region'] = [','.join(dict.fromkeys([d[y] for y in x if y in d])) for x in df['region']]` – jezrael Apr 26 '23 at 07:19