0

Sorry for bad English for writing the title. What I would like to do is to determine the location of df by comparing with another data frame df2. df is a data frame that shows the variation of longitude and latitude in a period of time. It is a data frame with more than 40,000 of rows.

    df=
                        Longitude Latitude     
        Time               
        2020-01-01 01:00    100.0   20.0       
        2020-01-01 01:01    100.2   20.1       
        2020-01-01 01:02    300.1   30.0       
        2020-01-01 01:03    200.1   40.0     
        2020-01-01 01:04     50.0   50.0  

df2 is the dataframe that contain the location data. It is a data frame with around 30 rows.

df2=
        Longitude Latitude   Location
    0        90.0   20.0     District A 
    1       210.0   60.0     District B

The method to determine where I am located will be done by Pythagoras theorem. Each row of data in df will compare with every row in df2 to find the shortest distance, which the shortest distance is where I am located.

Take 2020-01-01 01:01 as an example.

First, compare with df2 row 0, the calculation is [(100.2-90.0)^2+(20.1-20.0)^2]^0.5.

Second, compare with df2 row 1, the calculation is [(100.2-210.0)^2+(20.1-60.0)^2]^0.5

Since the distance between the location and District A is smaller than the distance between the location and District B, therefore the location on 2020-01-01 01:01 will be District A. The expected output is given as follows.

df=
                            Longitude Latitude  Location   
            Time               
            2020-01-01 01:00    100.0   20.0    District A   
            2020-01-01 01:01    100.2   20.1    District A   
            2020-01-01 01:02    300.1   30.0    District B   
            2020-01-01 01:03    200.1   40.0    District B 
            2020-01-01 01:04     50.0   50.0    District A

My approach: To get my expected result first I will create a df['Loaction'] column full of NaN, then I will do the calculation and extract the df2['Location'] to substitute the NaN value in df['Loaction'].

However, what I am confused about is how to get the calculation results. Are there any related pandas function or NumPy function that can compare a row from df with multiple rows from df2? Is there a better approach? Thank you!

soso
  • 49
  • 5

1 Answers1

3

I am ignoring the time column since no operation is required.

d = {'Longitude': [100.0,100.2,300.1,200.1,50.0], 'Latitude': [20.0 , 20.1,30.0,40.0, 50.0 ]}
df = pd.DataFrame(data=d)

d2 = {'Longitude':[90.0, 210.0], 'Latitude':[20.0, 60.0], 'Location':['District A', 'District B']}
df2 = pd.DataFrame(data=d2)

A lambda function may be applied row wise on df since we are comparing each row of df with all the rows of df2. It is not a one-to-one mapping.

So, at first I try to find the minimum distance and then get the 'Location' value from df2. Note that I have avoided the taking the sqrt value in the distance calculation to simplify solution. It should not matter anyway since only the minimum is required.

df.apply(lambda x: df2['Location'][(x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2 == min((x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2)].iloc[0], axis=1)

This should produce the following output.

df.apply(lambda x: df2['Location'][(x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2 == min((x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2)].iloc[0], axis=1)
0    District A
1    District A
2    District B
3    District B
4    District A
dtype: object

If you add the final column to df, then it looks as follows.

>>> df['Location'] = df.apply(lambda x: df2['Location'][(x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2 == min((x[0] - df2['Longitude'])**2 + (x[1] - df2['Latitude'])**2)].iloc[0], axis=1)
>>> 
>>> df
   Longitude  Latitude    Location
0      100.0      20.0  District A
1      100.2      20.1  District A
2      300.1      30.0  District B
3      200.1      40.0  District B
4       50.0      50.0  District A

  • If the solution is satisfactory, also please accept the answer as correct. It is the tick mark below the upvote button. – basicknowledge Dec 22 '20 at 19:08
  • I have tried your example it works perfectly. But when I apply to my code it gives an error: A value is trying to be set on a copy of a slice from a DataFrame. All the df['Location'] is changed to 'District A'. I have researched for a while but I don't understand how lambda is not working due to my df is a copy – soso Dec 23 '20 at 06:14
  • I am afraid, I would need more details such as the code you have written. However, this seems like a selection is performed on a data frame which has returned a copy and then when we try to set on this copy, the error is returned. Did you check this thread? https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – basicknowledge Dec 23 '20 at 09:20
  • Also, if the answer solved the question posted, please accept the answer as correct. – basicknowledge Dec 23 '20 at 09:27