I've got a function which returns a pandas Series to 2 columns in a dataframe. Currently my code looks like this:
def firstSite(coords, lat, long, date):
df1 = coords[coord['Date2'] <= date]
df1['distance'] = df1.apply(
lambda row: distance(lat, long, row['lat2'], row['long2'],
axis = 1)
df2 = df1.loc[df1.distance <= 2].nsmallest(1, 'Date2')[['Site Name','distance']]
return pd.Series([b2['Site Name'],b2['distance']])
df[['A','B']] = df.apply(
lambda row: firstSite(coords, row['lat'], row['lng'], row['Date'],
axis = 1)
Currently, it returns a pandas Series with the values from df2. However, when I look at the output outside the function, the output looks like this:
ID Date pc_lat pc_long A B
A 2016 51.5 -1.0 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
B 2016 51.6 -1.2 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
C 2016 51.6 -1.2 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
D 2016 51.6 -1.2 20 Drax Biomass Power Station - Unit 1 Name: Site Name, dtype: object 20 1.921752 Name: distance, dtype: float64
E 2016 51.5 -1.1 Series([], Name: Site Name, dtype: object) Series([], Name: distance, dtype: float64)
I've obviously returned the pandas series, not the pandas series values - however if I changed the code to:
return pd.Series([b2['Site Name'],b2['distance']]).values
I get an error. How can I modify my code to return the 'Site Name' & 'distance' values from b2?
Also I've messed around with column headings a bit here, so some of this won't make sense practically, however I'm simply looking for a solution to my problem, where I can return either an empty list/NaN or the value.
An example of the value in my mock CSV is "Drax Biomass Power Station - Unit 1"" for Site Name & "1.921752" for distance. I don't want all the rest of the info about the Series.
Edit:
Okay, so I'm using a Haversine formula I got linked to here. Here is my distance function:
def distanceBetweenCm(lat1, lon1, lat2, lon2):
"""
https://stackoverflow.com/questions/44910530/
how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693
Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula
"""
dLat = math.radians(lat2-lat1)
dLon = math.radians(lon2-lon1)
lat1 = math.radians(lat1)
lat2 = math.radians(lat2)
a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
return c * 6371 #multiply by 100k to get distance in cm
My code attempts finds the first site constructed within a certain radius (2km) for a CSV of transactions. Here is the function firstSite:
def firstSite(biomass, lat, long, date):
#Only if the Date of Operation of Biomass is after transaction date,
b1 = biomass[biomass['Date of Operation'] <= date]
#New distance column which is the distance between the two sets of points
b1['distance'] = b1.apply(
lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']),
axis=1)
#Create new dataframe where the smallest record from biomass within 2km is selected
b2 = b1.loc[b1.distance <= 2].nsmallest(1, 'Date of Operation')[['Site Name','distance']]
if b2.empty:
b2.loc[0] = [np.nan, np.nan]
return pd.Series([b2['Site Name'],b2['distance']])
I have played around with removing the code below since it makes it so much quicker.:
if b2.empty:
b2.loc[0] = [np.nan, np.nan]
I've got another function where I read in a CSV of transactions, read in the CSV full of Biomass sites. I then limit the biomass CSV to sites which were constructed before the transaction (although I may need transactions before & after construction later on) & then I run the function firstSite over the transaction dataframe (df1) & write to output CSV.
def addBioData(csv1, csv2, year):
df1 = pd.read_csv(csv1)
bio = "Biomass\PytAny\BiomassOp.csv"
biomass = pd.read_csv(bio)
print("Input Bio CSV: "+str(bio))
dt = datetime.date(year + 1, 1, 1)
biomass['Date of Operation'] = pd.to_datetime(biomass['Date of Operation'])
biomassyr = biomass[biomass['Date of Operation'] < dt]
df1[['FS2km', 'FS2kmDist']] = df1.apply(
lambda row: firstSite(biomassyr, row['pc_lat'], row['pc_long'], row['Date']),
axis = 1)
print(df1)
df1.to_csv(csv2,index=None,encoding='utf-8')
If there is a quicker way than using .apply, I'd be extremely interested! I will edit in a pastebin with sample csv in one sec.
I've made a mock-up version of what I'd like to finish with. Essentially, I want the Site Name of the first site built (by date) which is within 2km of the transaction coordinates. If there isn't any Biomass site within 2km, the values are "Null", or NaN.