0

I have a Dask DataFrame with sets of latitudes and longitudes (~32m records). I am trying to calculate the distance between the lat/lon using a function like below:

import numpy as np
from geopy import distance

def calc_distance(df, lat_col_name_1, lon_col_name_1, lat_col_name_2, lon_col_name_2):
if df[lat_col_name_1] != np.nan and df[lon_col_name_1] != np.nan and df[lat_col_name_2] != np.nan and df[lon_col_name_2] != np.nan:
    return distance.distance((df[lat_col_name_1], df[lon_col_name_1]), (df[lat_col_name_2], df[lon_col_name_2])).miles
else:
    return np.nan 

I have tried calling this function using map_partitions (to create a DataFrame of index and distance as well as calling map_paritions with assign. I would like to use assign so I can avoid joining the DataFrames back together (seems costly). It does not like the np.nan checks. I get a

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have records with null lat/lon so I need to account for that when calculating the distance.

Using map_partitions

distance = big_df.map_partitions(calc_distance, 
                                    lat_col_name_1='latitude_1', 
                                    lon_col_name_1='longitude_1', 
                                    lat_col_name_2='latitude_2', 
                                    lon_col_name_2='longitude_2', 
                                    meta={'distance': np.float64})

Using map_partitions and assign

def calc_distance_miles(lat1, lon1, lat2, lon2):
    if lat1 != np.nan and lon1 != np.nan and lat2 != np.nan and lon2 != np.nan:
        return distance.distance((lat1, lon1), (lat2, lon2)).miles
    else:
        return np.nan
    

big_df = big_df.map_partitions(lambda df: df.assign(
    distance=calc_distance_miles(df['latitude_1'], df['longitude_1'], df['latitude_2'], df['longitude_2'])
), meta={'distance': np.float64}
)
m.will325
  • 85
  • 5
  • be careful with boolean operators with `np.nan`. NaN never is equal to anything. Note that `np.nan != np.nan` evaluates to `True`. So your test doesn't do anything. Instead, use `pd.isnull()` or the `isnull` methods on DataFrame and Series. See [the pandas docs on working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) – Michael Delgado Nov 10 '21 at 16:58

1 Answers1

0

map_partitions isn't like df.apply - the function calc_distance is being called with a partition of the dask.dataframe, which has type pd.DataFrame.

Therefore, df[lat_col_name_1] is a Series and df[lat_col_name_1] != np.nan is a boolean series (which will always return this error - see e.g. Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()).

There are faster array ways of computing a distance than element-wise, but the dask.dataframe analogue to what you're trying to do is to use map_partitions and then apply:

def calc_distance(series, lat_col_name_1, lon_col_name_1, lat_col_name_2, lon_col_name_2):
    if series[
        [lat_col_name_1, lon_col_name_1, lat_col_name_2, lon_col_name_2]
    ].notnull().all():

        return distance.distance(
            (series[lat_col_name_1], series[lon_col_name_1]),
            (series[lat_col_name_2], series[lon_col_name_2]),
        ).miles

    else:
        return np.nan 

def calc_distance_df(df, **kwargs):
    return df.apply(calc_distance, axis=1, **kwargs)

distances = big_df.map_partitions(
    calc_distance_df,
    meta=np.float64,
    lat_col_name_1=lat_col_name_1,
    lon_col_name_1=lon_col_name_1,
    lat_col_name_2=lat_col_name_2,
    lon_col_name_2=lon_col_name_2,
)
Michael Delgado
  • 13,789
  • 3
  • 29
  • 54
  • Thank you for the explanation. Unfortunately, this did not quite work. I got this error when porting this in: "None of [Index(['latitude_1', 'longitude_1', 'latitude_2',\n 'longitude_2'],\n dtype='object')] are in the [index]" Could you also explain a little bit what you mean by faster ways than calculating elemental wise? – m.will325 Nov 10 '21 at 16:35
  • oops - sorry I used the wrong axis argument. should have been axis=1, not axis=0. try again. – Michael Delgado Nov 10 '21 at 16:40
  • it's significantly faster to do any sort of operation in pandas using arrays. you're looping over the individual rows in the dataframe (pd.apply is just a for loop) and then calculating distance between the elements in each row. You could use a vectorized algorithm that calculates distances between the points in each row for all rows in the dataframe at once - this will be significantly faster. not sure if geopy supports this, but you could use geopandas or another vectorized library. – Michael Delgado Nov 10 '21 at 16:42
  • Thank you Michael. I am testing now. I originally had this in pandas looping over chunksizes, but took way too long to run (couple of hours for 32m records). I used the below: `chunk['distance_miles'] = np.vectorize(calc_distance_miles)(chunk['point1'], chunk['point2'])` Where the points where tuples of the lat/lon. This was much faster than applying the function, but still too slow. Is that what you are meaning by vectorizing? – m.will325 Nov 10 '21 at 17:22
  • actually np.vectorize is similar in that it's just looping over the elements. I mean something working with the entire vector at the same time, as in this question: [vectorizing haversine distance calculation in python](https://stackoverflow.com/questions/34502254/vectorizing-haversine-distance-calculation-in-python) – Michael Delgado Nov 10 '21 at 18:10