One thing I keep hearing in the pandas community here is to avoid apply. I'd love to avoid apply. I just have no idea how to avoid apply in a scenario like that below.
I have data listing the location of bank branches all over the country. I also have a list of branches for which I want to calculate market shares for within 10 miles.
To do this (perhaps because I still somewhat think in Java terms), I iterate through each branch location. For each branch location, get the branches which are near the branch, then calculate the market share of that branch (and branches owned by the same firm) relative to the total amount in that area.
def haversine(lat1, lon1, lat2, lon2):
""" Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
Source: https://stackoverflow.com/questions/29545704/
fast-haversine-approximation-python-pandas """
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
miles = 6367 * c * 0.621371 # edited here for conversion to miles
return miles
def get_near(df, lat, long, mi):
''' Subsets all the data to a set which is larger than that defined by the
circle radius in haversine, but much smaller, meaning that calculation
times should be reduced significantly.'''
degrees_diff = mi / haversine(lat, 1, lat, 0) # conversion rate changes w/ lat
# use of a dynamic difference is most accurate. use of 69 mi/deg is wrong.
lat_fuzz = mi / 69 * 1.2
long_fuzz = degrees_diff * 1.2
# fuzz is distance diff, creating a square-like thing. Now, if set to 1,
# that is a square with sides 2 * 1. We want a bit more, just in case.
df = df[np.abs(df['lat'] - lat) < lat_fuzz]
df = df[np.abs(df['long'] - long) < long_fuzz]
return df
def aggregate(row):
cert = row['CERT']
year = row['YEAR']
lat = row['lat']
long = row['long']
# subset to the things near enough for us to care about
near_set = get_near(all_branches.query('YEAR == @year'), lat, long, DIST)
# get real distance and subset
near_set['dist'] = haversine(lat, long, near_set['lat'], near_set['long'])
mkt_set = near_set.query('dist < @DIST')
# get set of certs
cert_set = mkt_set.query('CERT == @cert')
cert_dep = cert_set['DEPSUMBR'].sum()
cert_brs = cert_set['BRNUM'].unique().size
# calculate mkt share percentages
set_dep = mkt_set['DEPSUMBR'].sum()
set_brs = len(mkt_set.drop_duplicates())
set_bk_count = mkt_set['CERT'].unique().size
return {
'dep_mktsh': (cert_dep / set_dep) if set_dep != 0 else np.nan,
'brs_mktsh': cert_brs / set_brs if set_brs != 0 else np.nan,
'mkt_brs': set_brs,
'bk_ct': set_bk_count
}
This is started by the following:
branches.apply(lambda r: pd.Series(aggregate(r)), axis=1)
How can I refactor the apply
out of the entry point to these functions?
For some data context. Given an observation with these variables: CERT: 25272; lat: 32.6756; long: -94.5709; YEAR: 1998
, this is despatched against a whole set of branches. I'll subset that list of 1.6 million down to the ones that are relevant here:
CERT YEAR BRNUM lat long DEPSUMBR
1507794 3310 1998 0 32.504207 -94.729451 128656
1507795 3310 1998 1 32.546868 -94.732487 5632
1519246 5553 1998 15 32.545481 -94.366318 116412
1519247 5553 1998 16 32.509617 -94.738234 39609
1532146 11140 1998 0 32.504674 -94.579482 17929
1532147 11140 1998 1 32.512571 -94.710928 2395
1539969 15374 1998 7 32.549888 -94.380500 2202
1546345 17027 1998 0 32.800508 -94.719626 26590
1546346 17027 1998 1 32.724252 -94.704926 2338
1546347 17027 1998 2 32.800508 -94.719626 0
1548642 17842 1998 3 32.551573 -94.757148 149396
1548643 17842 1998 4 32.551573 -94.757148 0
1548644 17842 1998 5 32.551573 -94.757148 0
1553923 20707 1998 1 32.515676 -94.768812 3075
1557431 23261 1998 1 32.537950 -94.764513 38840
1557432 23261 1998 2 32.588054 -94.738605 1417
1557743 23420 1998 2 32.504207 -94.729451 9655
1559350 25272 1998 0 32.540713 -94.367709 111082
1559352 25272 1998 2 32.675641 -94.570889 4606
1559378 25294 1998 4 32.542073 -94.740780 7373
1559575 25752 1998 0 32.505406 -94.661607 20828
1562959 27474 1998 71 32.545895 -94.367224 89796
1563070 27474 1998 249 32.543712 -94.747411 89492
1564490 28066 1998 3 32.519644 -94.741130 14465
1564493 28066 1998 6 32.542174 -94.736346 422
1567824 29221 1998 1 32.540258 -94.754324 16969
1567825 29221 1998 2 32.503581 -94.688224 7102
Note that this is the result of the get_near
function and the all_branches#query
invocation, which has already subset down from the list of all branches to those which are reasonably close by coordinates and in the correct year. From there, the rest of the function yields:
dep_mktsh 0.663306
brs_mktsh 0.500000
mkt_brs 2.000000
bk_ct 2.000000