I have a pandas dataframe containing a record of lightning strikes with timestamps and global positions in the following format:
Index Date Time Lat Lon Good fix?
0 1 20160101 00:00:00.9962692 -7.1961 -60.7604 1
1 2 20160101 00:00:01.0646207 -7.0518 -60.6911 1
2 3 20160101 00:00:01.1102066 -25.3913 -57.2922 1
3 4 20160101 00:00:01.2018573 -7.4842 -60.5129 1
4 5 20160101 00:00:01.2942750 -7.3939 -60.4992 1
5 6 20160101 00:00:01.4431493 -9.6386 -62.8448 1
6 8 20160101 00:00:01.5226157 -23.7089 -58.8888 1
7 9 20160101 00:00:01.5932412 -6.3513 -55.6545 1
8 10 20160101 00:00:01.6736350 -23.8019 -58.9382 1
9 11 20160101 00:00:01.6957858 -24.5724 -57.7229 1
Actual dataframe contains millions of rows. I wish to separate out events which happened far away in space and time from other events, and store them in a new dataframe isolated_fixes
. I have written code to calculate the separation of any two events which is as follows:
def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #Uses haversine formula to calculate distance between points, returning a tuple with Boolean closeness statement, and numerical distance
radlat1 = m.radians(strike1[1][latpos])
radlon1 = m.radians(strike1[1][lonpos])
radlat2 = m.radians(strike2[1][latpos])
radlon2 = m.radians(strike2[1][lonpos])
a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2)
c=2*m.atan2((a**0.5),((1-a)**0.5))
R=6371 #earth radius in km
d=R*c #distance between points in km
if d <= defclose:
return (True,d)
else:
return (False,d)
and for time:
def getdatetime(series,timelabel=2,datelabel=1,timeformat="%X.%f",dateformat="%Y%m%d"):
time = dt.datetime.strptime(series[1][timelabel][:15], timeformat)
date = dt.datetime.strptime(str(series[1][datelabel]), dateformat)
datetime = dt.datetime.combine(date.date(),time.time())
return datetime
def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)):
dt1=getdatetime(strike1)
dt2=getdatetime(strike2)
timediff=abs(dt1-dt2)
if timediff<=defclose:
return(True, timediff)
else:
return(False, timediff)
The real problem is how to efficiently compare all events to all other events to determine how many of them are space_close and time_close.
Note that not all events need to be checked, as they are ordered with respect to datetime, so if there was a way to check events 'middle out' and then stop when events were no longer close in time, that would save a lot of operations, but I dont know how to do this.
At the moment, my (nonfunctional) attempt looks like this:
def extrisolfixes(data,filtereddata,defisol=4):
for strike1 in data.iterrows():
near_strikes=-1 #-1 to account for self counting once on each loop
for strike2 in data.iterrows():
if are_strikes_space_close(strike1,strike2)[0]==True and are_strikes_time_close(strike1,strike2)[0]==True:
near_strikes+=1
if near_strikes<=defisol:
filtereddata=filtereddata.append(strike1)
Thanks for any help! Am happy to provide clarification if needed.