1

I have two data frames. First, saves coordinates and places

data = [['xx1', 1,10,'Madrid'], ['xx2',15,20,'London']]
df1 = pd.DataFrame(data, columns = ['precode', 'start', 'stop','city'])

The second

data2 = [['xx1', 3,11,'location1',], ['xx2',9,17,'location2'],['xx2',21,30,'location3'],['xx5',1,19,'location4']]
df2 = pd.DataFrame(data2, columns = ['precode', 'start', 'stop','locations'],)

I want to check if locations are in city. If so add in df2 the name of the cities

precode  start   stop     location       city
xx1      3       11       location1      Madrid
xx2      9       17       location2      Madrid;London # If more than one city affected, add as many as cities affected as found
xx2      21      30       location3      NaN   
xx5      1       19       location4      NaN

I don't even know how to start. Shall I convert column into a dict to do the checking process or can shall I itinerate values of one column to the column of the other data frame. df1 is much bigger than df2. Both data frames have more columns. The final output need to be an identical as was df2 but with the new column city.

  • whats the logic here? you can use a merge but you might need to flatten out your dataframe first. I don't understand how precode `xx1` has Madrid and precode `xx2` has London & Madrid? – Umar.H Feb 09 '22 at 11:53
  • Thanks for your reply. Precode is a identification thing, nothing related to coordinates (sorry I can't give more info). But one thing is sure, to match, both need to be the same. I though about merge but then I didn't know how to keep df2 as it was but just with the new column 'city'. – Manolo Dominguez Becerra Feb 09 '22 at 11:58

1 Answers1

2

Use cross merge then keep values within the range then aggregate cities:

df2['city'] = (
    df2.loc[df2['precode'].isin(df1['precode'])]   
       .reset_index().merge(df1, how='cross', suffixes=('', '_'))
       .query('start.between(start_, stop_) | stop.between(start_, stop_)')
       .groupby('index')['city'].apply(list).str.join(';')
)

Output:

>>> df2
  precode  start  stop  locations           city
0     xx1      3    11  location1         Madrid
1     xx2      9    17  location2  Madrid;London
2     xx2     21    30  location3            NaN
3     xx5      1    19  location4            NaN

Step by step:

>>> out = df2.loc[df2['precode'].isin(df1['precode'])]
  precode  start  stop  locations
0     xx1      3    11  location1
1     xx2      9    17  location2
2     xx2     21    30  location3

>>> out = out.reset_index().merge(df1, how='cross', suffixes=('', '_'))
   index precode  start  stop  locations precode_  start_  stop_    city
0      0     xx1      3    11  location1      xx1       1     10  Madrid
1      0     xx1      3    11  location1      xx2      15     20  London
2      1     xx2      9    17  location2      xx1       1     10  Madrid
3      1     xx2      9    17  location2      xx2      15     20  London
4      2     xx2     21    30  location3      xx1       1     10  Madrid
5      2     xx2     21    30  location3      xx2      15     20  London


>>> out = out.query('start.between(start_, stop_) | stop.between(start_, stop_)')
   index precode  start  stop  locations precode_  start_  stop_    city
0      0     xx1      3    11  location1      xx1       1     10  Madrid
2      1     xx2      9    17  location2      xx1       1     10  Madrid
3      1     xx2      9    17  location2      xx2      15     20  London


>>> out = out.groupby('index')['city'].apply(list).str.join(';')
index
0           Madrid
1    Madrid;London
Name: city, dtype: object
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • thanks! I will have a look at very soon and I will come back. Another issue I haven't mention but it doesn't matter in your suggestion is that I only can use pandas as I am working in a secure cluster and I have a very limited packages options – Manolo Dominguez Becerra Feb 09 '22 at 12:20
  • I just hope you have a recent version of Pandas... :) – Corralien Feb 09 '22 at 12:21
  • I havent implemented this yet but I can see one problem (perhaps due to my bad English) the last line of the output should be Na. If two location match in coordinates but precode is different, this shouldn't match – Manolo Dominguez Becerra Feb 09 '22 at 12:23
  • very cool - I think it's better to explode the ranges by start and stop then join - will be more performant than query. – Umar.H Feb 09 '22 at 12:28
  • I am working with your code @Corralien to make. it work with the last line of the output what needs to be ```xx5 1 19 location4 NaN``` instead of ```xx5 1 19 location4 Madrid;London``` I will accept your answer if this would be correct but I appreciate your help anyway – Manolo Dominguez Becerra Feb 09 '22 at 12:35
  • @ManuelDominguez. What I don't understand is why city of Location 4 should be 'nan' rather than city of Location 2 is 'Madrid;London' – Corralien Feb 09 '22 at 12:49
  • @Umar.H. What do you mean? Use `melt`? – Corralien Feb 09 '22 at 12:52
  • Because, if the precode is different between df1 and df2, that means they are not the same city. I am using here an imaginative case. This is for a program to be used in the army and I can't use the real data. Basically, cities are actually locations on a map, and we used different maps. To be in the right place, maps need to be equal and then coordinates need to be in range in range. Different maps have different coordinated. I can't compare different maps – Manolo Dominguez Becerra Feb 09 '22 at 12:58
  • You were right. My pandas is an earlier version than 1.4 – Manolo Dominguez Becerra Feb 09 '22 at 14:39