0

I've got a function which returns a pandas Series to 2 columns in a dataframe. Currently my code looks like this:

def firstSite(coords, lat, long, date):
    df1 = coords[coord['Date2'] <= date]
    df1['distance'] = df1.apply(
        lambda row: distance(lat, long, row['lat2'], row['long2'],
        axis = 1)

    df2 = df1.loc[df1.distance <= 2].nsmallest(1, 'Date2')[['Site Name','distance']] 

    return pd.Series([b2['Site Name'],b2['distance']])

df[['A','B']] = df.apply(
    lambda row: firstSite(coords, row['lat'], row['lng'], row['Date'],
    axis = 1)

Currently, it returns a pandas Series with the values from df2. However, when I look at the output outside the function, the output looks like this:

ID Date pc_lat pc_long A                                            B

A  2016 51.5   -1.0    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
B  2016 51.6   -1.2    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
C  2016 51.6   -1.2    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)
D  2016 51.6   -1.2    20    Drax Biomass Power Station - Unit 1 Name: Site Name, dtype: object 20    1.921752 Name: distance, dtype: float64
E  2016 51.5   -1.1    Series([], Name: Site Name, dtype: object)   Series([], Name: distance, dtype: float64)

I've obviously returned the pandas series, not the pandas series values - however if I changed the code to:

return pd.Series([b2['Site Name'],b2['distance']]).values

I get an error. How can I modify my code to return the 'Site Name' & 'distance' values from b2?

Also I've messed around with column headings a bit here, so some of this won't make sense practically, however I'm simply looking for a solution to my problem, where I can return either an empty list/NaN or the value.

An example of the value in my mock CSV is "Drax Biomass Power Station - Unit 1"" for Site Name & "1.921752" for distance. I don't want all the rest of the info about the Series.


Edit:

Okay, so I'm using a Haversine formula I got linked to here. Here is my distance function:

def distanceBetweenCm(lat1, lon1, lat2, lon2):
    """
    https://stackoverflow.com/questions/44910530/
    how-to-find-the-distance-between-2-points-in-2-different-dataframes-in-pandas/44910693#44910693
    Haversine Formula: https://en.wikipedia.org/wiki/Haversine_formula
    """
    dLat = math.radians(lat2-lat1)
    dLon = math.radians(lon2-lon1)

    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)

    a = math.sin(dLat/2)**2 + math.sin(dLon/2)**2 * math.cos(lat1) * math.cos(lat2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    return c * 6371 #multiply by 100k to get distance in cm

My code attempts finds the first site constructed within a certain radius (2km) for a CSV of transactions. Here is the function firstSite:

def firstSite(biomass, lat, long, date):

    #Only if the Date of Operation of Biomass is after transaction date, 
    b1 = biomass[biomass['Date of Operation'] <= date]

    #New distance column which is the distance between the two sets of points
    b1['distance'] = b1.apply(
        lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']), 
        axis=1)

    #Create new dataframe where the smallest record from biomass within 2km is selected
    b2 = b1.loc[b1.distance <= 2].nsmallest(1, 'Date of Operation')[['Site Name','distance']]
    if b2.empty:                        
        b2.loc[0] = [np.nan, np.nan]  
    return pd.Series([b2['Site Name'],b2['distance']])

I have played around with removing the code below since it makes it so much quicker.:

    if b2.empty:                        
        b2.loc[0] = [np.nan, np.nan] 

I've got another function where I read in a CSV of transactions, read in the CSV full of Biomass sites. I then limit the biomass CSV to sites which were constructed before the transaction (although I may need transactions before & after construction later on) & then I run the function firstSite over the transaction dataframe (df1) & write to output CSV.

def addBioData(csv1, csv2, year):
    df1 = pd.read_csv(csv1)
    bio = "Biomass\PytAny\BiomassOp.csv"
    biomass = pd.read_csv(bio)
    print("Input Bio CSV: "+str(bio))

    dt = datetime.date(year + 1, 1, 1)
    biomass['Date of Operation']  = pd.to_datetime(biomass['Date of Operation'])
    biomassyr = biomass[biomass['Date of Operation'] < dt]
    df1[['FS2km', 'FS2kmDist']] = df1.apply(
        lambda row: firstSite(biomassyr, row['pc_lat'], row['pc_long'], row['Date']),
        axis = 1)
    print(df1)

    df1.to_csv(csv2,index=None,encoding='utf-8')

If there is a quicker way than using .apply, I'd be extremely interested! I will edit in a pastebin with sample csv in one sec.

biomass CSV

Transaction Price CSV

Output CSV

I've made a mock-up version of what I'd like to finish with. Essentially, I want the Site Name of the first site built (by date) which is within 2km of the transaction coordinates. If there isn't any Biomass site within 2km, the values are "Null", or NaN.

christaylor
  • 361
  • 1
  • 5
  • 14
  • 1
    Try to avoid using `.apply()` and especially `.apply(..., axis=1)` as it's __very__ slow. Can you post a small sample data set and desired data set. If you need to calculate distance - there might be vectorized solutions for that (depending on the algorithm). What `distance` algorithm (haversine, euclidean, etc.) are you using? – MaxU - stand with Ukraine Jul 16 '17 at 11:40
  • 'distance' must be one of your own functions whose definition does not appear here, so can't really test out the code you pasted – cardamom Jul 16 '17 at 11:42
  • I'll edit in a more complete version of the code I'm using + sample datasets in the main post. Check back here in 5 mins & I'll include some more useful info. – christaylor Jul 16 '17 at 11:42
  • Added some edits now – christaylor Jul 16 '17 at 12:11
  • @CTaylor19, i'm trying to understand how your desired data set would look like... It would be much easier if you would add small sample input data sets and a desired one... – MaxU - stand with Ukraine Jul 16 '17 at 12:40
  • Okay, adding now. – christaylor Jul 16 '17 at 12:41
  • Thanks a tonne for the help by the way, I feel kinda useless & pathetic asking on here so much but I need to get it done! – christaylor Jul 16 '17 at 12:47
  • @CTaylor19, what values do you use for the `year` parameter when calling `addBioData()` function? – MaxU - stand with Ukraine Jul 16 '17 at 13:21
  • That is simply a string representing the year. I have CSVs with transactions for every year from 1995 to 2017, so for a lot of stuff I'm doing, I'm looping through a range of strings – christaylor Jul 16 '17 at 13:23

0 Answers0