2
import pandas as pd
dict = {'Origin Region': [1.0, 2.0, 3.0, 4.0, 5.0, 1.0, 2.0, 5.0],
        'Origin Latitude': [-36.45875, -36.24879, -36.789456, -38.14789, -36.15963, -36.159455, -36.2345, -36.12745],
        'Origin Longitude': [145.14563, 145.15987, 145.87456, 146.75314, 145.75483, 145.78458, 145.123654, 145.11111]}

df = pd.DataFrame(dict)

centres_dict = {'Origin Region': [1.0, 2.0, 3.0, 4.0, 5.0],
        'Origin Latitude': [-36.25361, -36.78541, -36.74859, -38.74123, -36.14538],
        'Origin Longitude': [145.12345, 145.36241, 145.12365, 146.75314, 145.75483]}

centres_df = pd.DataFrame(centres_dict)

grouped_region = df.groupby('Origin Region')
for region, region_group in grouped_region:
    outliers = region_group[['Origin Latitude', 'Origin Longitude']].where((region_group['Origin Latitude'] < -36.15))
    outliers.dropna(inplace=True)
    print(outliers)
    if(~outliers.empty):
        for index, outlier_value in outliers.iterrows():
            for another_index, centre_value in centres_df.iterrows():
                a = outlier_value['Origin Longitude']
                b = outlier_value['Origin Latitude']
                c = centres_df['Origin Longitude']
                d = centres_df['Origin Latitude']
                #find distance using the above and then find minimum distance

I am trying to loop through each group of a dataframe (df), then filter values in each group based on some condition and perform distance computation between between each of these filtered values (outliers) and all the values in another dataframe (centres_df).

I have the data in dataframes, should i convert them into arrays and then use scipy cdist to calculate distances ? or simply use a loop and use my own distance calculation function ? I am not sure what is the best way to do this. Or maybe use apply and call my own distance function ?

Vandhana
  • 333
  • 5
  • 15
  • There does not appear to be a singular *Outlier Dataframe* as you run inside a `groupby` loop. – Parfait Oct 02 '18 at 20:40
  • I would like to calculate outlier for each group in the grouped_region. And distance for each outlier in each of these groups with all points in the centres dataframe. – Vandhana Oct 02 '18 at 20:43
  • Please post a [MCVE] including sample data, compilable code with all needed `import` lines and assignments like `haversine()` that we can run in an empty Python environment, and desired output. See also [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Parfait Oct 02 '18 at 20:56
  • Hi Parafait, tried my best to come up with minimal and reproducible code. – Vandhana Oct 03 '18 at 08:34

1 Answers1

1

No need for nested looping. Simply join the grouped outliers to the centres data frame inside the group by loop. Then calculate distance across columns. Then at end, bind all outlier frames together from a dictionary of data frame objects.

However, to vectorize your process this Python Haversine Formula using the built-in math library will have to be numpy-ified.

Numpy version of haversine formula (receiving arrays/series not scalars as inputs)

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles

    return c * r

Pandas process

# SET ORIGIN REGION AS INDEX (FOR LATER JOIN)
centres_df = centres_df.set_index('Origin Region')

df_dict = {}
grouped_region = df.sort_values('Origin Region').groupby('Origin Region')

for region, region_group in grouped_region:   
    # BUILD OUTLIER DF WITH Origin_Region as INDEX 
    outliers = region_group[['Origin Latitude', 'Origin Longitude']]\
                     .where((region_group['Origin Latitude'] < -36.15))\
                     .dropna()\
                     .assign(Origin_Region = region)\
                     .set_index('Origin_Region')

    # JOIN OUTLIERS WITH CENTRES DF, KEEPING ONLY MATCHED ROWS
    outliers = outliers.join(centres_df, how='inner', lsuffix='', rsuffix='_')

    # RUN CALCULATION (SEE NUMPY-IFIED haversine())
    outliers['Distance_km'] = haversine_np(outliers['Origin Longitude'], outliers['Origin Latitude'],
                                           outliers['Origin Longitude_'], outliers['Origin Latitude_'])

    outliers['Origin Region'] = region

    # ASSIGN TO DICTIONARY, RE-ORDERING COLUMNS
    df_dict[region] = outliers.reindex(outliers.columns[[5,0,1,2,3,4]], axis='columns')

# CONCATENATE OUTSIDE LOOP FOR SINGLE OBJECT
final_df = pd.concat(df_dict, ignore_index=True)

Output

print(final_df)

#    Origin Region  Origin Latitude  Origin Longitude  Origin Latitude_  Origin Longitude_  Distance_km
# 0            1.0       -36.458750        145.145630         -36.25361          145.12345    22.896839
# 1            1.0       -36.159455        145.784580         -36.25361          145.12345    60.234887
# 2            2.0       -36.248790        145.159870         -36.78541          145.36241    62.354177
# 3            2.0       -36.234500        145.123654         -36.78541          145.36241    64.868402
# 4            3.0       -36.789456        145.874560         -36.74859          145.12365    67.040011
# 5            4.0       -38.147890        146.753140         -38.74123          146.75314    65.976398
# 6            5.0       -36.159630        145.754830         -36.14538          145.75483     1.584528
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • what I need is no. of outliers times no. of points in centre_df for each group for each group. If the no. of outliers in group 1 is 2 points, and no. of points in centred_df (fixed) is 5, I need 2x5 = 10 distances – Vandhana Oct 04 '18 at 06:06
  • See edit with adjustment to *outliers* build inside loop and using a numpy-ified version of `haversine()`. – Parfait Oct 04 '18 at 14:10