0

One thing I keep hearing in the pandas community here is to avoid apply. I'd love to avoid apply. I just have no idea how to avoid apply in a scenario like that below.

I have data listing the location of bank branches all over the country. I also have a list of branches for which I want to calculate market shares for within 10 miles.

To do this (perhaps because I still somewhat think in Java terms), I iterate through each branch location. For each branch location, get the branches which are near the branch, then calculate the market share of that branch (and branches owned by the same firm) relative to the total amount in that area.

def haversine(lat1, lon1, lat2, lon2):
    """ Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    Source: https://stackoverflow.com/questions/29545704/
            fast-haversine-approximation-python-pandas """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    miles = 6367 * c * 0.621371  # edited here for conversion to miles
    return miles


def get_near(df, lat, long, mi):
    ''' Subsets all the data to a set which is larger than that defined by the
    circle radius in haversine, but much smaller, meaning that calculation
    times should be reduced significantly.'''
    degrees_diff = mi / haversine(lat, 1, lat, 0)  # conversion rate changes w/ lat
    # use of a dynamic difference is most accurate. use of 69 mi/deg is wrong.

    lat_fuzz = mi / 69 * 1.2
    long_fuzz = degrees_diff * 1.2
    # fuzz is distance diff, creating a square-like thing. Now, if set to 1,
    # that is a square with sides 2 * 1. We want a bit more, just in case.

    df = df[np.abs(df['lat'] - lat) < lat_fuzz]
    df = df[np.abs(df['long'] - long) < long_fuzz]
    return df


def aggregate(row):
    cert = row['CERT']
    year = row['YEAR']
    lat = row['lat']
    long = row['long']

    # subset to the things near enough for us to care about
    near_set = get_near(all_branches.query('YEAR == @year'), lat, long, DIST)

    # get real distance and subset
    near_set['dist'] = haversine(lat, long, near_set['lat'], near_set['long'])
    mkt_set = near_set.query('dist < @DIST')

    # get set of certs
    cert_set = mkt_set.query('CERT == @cert')
    cert_dep = cert_set['DEPSUMBR'].sum()
    cert_brs = cert_set['BRNUM'].unique().size

    # calculate mkt share percentages
    set_dep = mkt_set['DEPSUMBR'].sum()
    set_brs = len(mkt_set.drop_duplicates())
    set_bk_count = mkt_set['CERT'].unique().size
    return {
        'dep_mktsh': (cert_dep / set_dep) if set_dep != 0 else np.nan,
        'brs_mktsh': cert_brs / set_brs if set_brs != 0 else np.nan,
        'mkt_brs': set_brs,
        'bk_ct': set_bk_count
    }

This is started by the following:

branches.apply(lambda r: pd.Series(aggregate(r)), axis=1)

How can I refactor the apply out of the entry point to these functions?


For some data context. Given an observation with these variables: CERT: 25272; lat: 32.6756; long: -94.5709; YEAR: 1998, this is despatched against a whole set of branches. I'll subset that list of 1.6 million down to the ones that are relevant here:

          CERT  YEAR  BRNUM        lat       long  DEPSUMBR
1507794   3310  1998      0  32.504207 -94.729451    128656
1507795   3310  1998      1  32.546868 -94.732487      5632
1519246   5553  1998     15  32.545481 -94.366318    116412
1519247   5553  1998     16  32.509617 -94.738234     39609
1532146  11140  1998      0  32.504674 -94.579482     17929
1532147  11140  1998      1  32.512571 -94.710928      2395
1539969  15374  1998      7  32.549888 -94.380500      2202
1546345  17027  1998      0  32.800508 -94.719626     26590
1546346  17027  1998      1  32.724252 -94.704926      2338
1546347  17027  1998      2  32.800508 -94.719626         0
1548642  17842  1998      3  32.551573 -94.757148    149396
1548643  17842  1998      4  32.551573 -94.757148         0
1548644  17842  1998      5  32.551573 -94.757148         0
1553923  20707  1998      1  32.515676 -94.768812      3075
1557431  23261  1998      1  32.537950 -94.764513     38840
1557432  23261  1998      2  32.588054 -94.738605      1417
1557743  23420  1998      2  32.504207 -94.729451      9655
1559350  25272  1998      0  32.540713 -94.367709    111082
1559352  25272  1998      2  32.675641 -94.570889      4606
1559378  25294  1998      4  32.542073 -94.740780      7373
1559575  25752  1998      0  32.505406 -94.661607     20828
1562959  27474  1998     71  32.545895 -94.367224     89796
1563070  27474  1998    249  32.543712 -94.747411     89492
1564490  28066  1998      3  32.519644 -94.741130     14465
1564493  28066  1998      6  32.542174 -94.736346       422
1567824  29221  1998      1  32.540258 -94.754324     16969
1567825  29221  1998      2  32.503581 -94.688224      7102

Note that this is the result of the get_near function and the all_branches#query invocation, which has already subset down from the list of all branches to those which are reasonably close by coordinates and in the correct year. From there, the rest of the function yields:

dep_mktsh    0.663306
brs_mktsh    0.500000
mkt_brs      2.000000
bk_ct        2.000000
ifly6
  • 5,003
  • 2
  • 24
  • 47
  • 2
    A small example dataset of 5-10 rows would clear things up a lot, since we can visually see what you try to accomplish, rather than read through your code. – Erfan Apr 30 '19 at 18:09
  • This is difficult, as the origin for the distance calculation changes for each branch in branches, forcing you to do N_branch separate calculations so apply (or at least a loop) seems reasonable. How large are these datasets? It is possible with a "dumb" cartesian product `merge`, though you might run into Memory problems. – ALollz Apr 30 '19 at 18:12
  • 1
    @Erfan coming right up; @ALollz Thousands, I thought about doing it like that, but the reason I have the `get_near` function is so I don't have to do that sort of stuff across the whole dataset – ifly6 Apr 30 '19 at 18:17
  • @ALollz The `all_branches` dataset has 1.6 million entries – ifly6 Apr 30 '19 at 18:25

0 Answers0