segmenting or grouping a df based on parameters or differences within columns going down the dataframe rows?

Question

I was trying to figure out if there was a way in which where I had a dataframe with multiple fields and I wanted to segment or group the dataframe into a new dataframe based on if the values of specific columns were within x amount of each other?

   I.D  |      Created_Time            | Home_Longitude | Home_Latitude | Work_Longitude | Home_Latitude
  Faa1      2019-02-23 20:01:13.362           -77.0364            38.8951    -72.0364      38.8951

Above is how the original df looks with multiple rows. I want to create a new dataframe where all rows or I.Ds contain created times that are within x amount of minutes of each other, and using haversine within x miles of one another homes, and x miles within one another work.

So Basically trying to filter this dataframe into a df that only contains rows that are within x minutes of order created time, x miles within one another homes and , x miles within each work column value.

Compute the delta columns then use the logic you state in the last sentence and filter based on those new diffs. — Kyle, Mar 07 '19 at 02:08

edesz · Answer 1 · 2019-03-16T20:14:15.770

I did this by

calculating the distances (in miles) and time relative to the first row
- My logic
  - if n rows are within x minutes/miles of the first row, then those n rows are within x minutes/miles of each other
filter the data using the required distance and time filter conditions

Generate some dummy data

random co-ordinates

# Generate random Lat-Long points
def newpoint():
   return uniform(-180,180), uniform(-90, 90)
home_points = (newpoint() for x in range(289))
work_points = (newpoint() for x in range(289))

df = pd.DataFrame(home_points, columns=['Home_Longitude', 'Home_Latitude'])
df[['Work_Longitude', 'Work_Latitude']] = pd.DataFrame(work_points)

# Insert `ID` column as sequence of integers
df.insert(0, 'ID', range(289))

# Generate random datetimes, separated by 5 minute intervals
# (you can choose your own interval)
times = pd.date_range('2012-10-01', periods=289, freq='5min')
df.insert(1, 'Created_Time', times)
print(df.head())

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude
0   0 2012-10-01 00:00:00      -48.885981     -39.412351      -68.756244      24.739860
1   1 2012-10-01 00:05:00       58.584893      59.851739     -119.978429     -87.687858
2   2 2012-10-01 00:10:00      -18.623484      85.435248      -14.204142      -3.693993
3   3 2012-10-01 00:15:00      -29.721788      71.671103      -69.833253     -12.446204
4   4 2012-10-01 00:20:00      168.257968     -13.247833       60.979050     -18.393925

Create Python helper function with haversine distance formula (vectorized haversine distance formula, in km)

def haversine(lat1, lon1, lat2, lon2, to_radians=False, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

Calculate distances (relative to first row) in km, using haversine formula. Then, convert km to miles

df['Home_dist_miles'] = \
    haversine(df.Home_Longitude, df.Home_Latitude,
                 df.loc[0, 'Home_Longitude'], df.loc[0, 'Home_Latitude'])*0.621371
df['Work_dist_miles'] = \
    haversine(df.Work_Longitude, df.Work_Latitude,
                 df.loc[0, 'Work_Longitude'], df.loc[0, 'Work_Latitude'])*0.621371

Calculate time differences, in minutes (relative to first row)

for the dummy data here, the time differences will be in multiples of 5 minutes (but in real data, they could be anything)

df['time'] = df['Created_Time'] - df.loc[0, 'Created_Time']
df['time_min'] = (df['time'].dt.days * 24 * 60 * 60 + df['time'].dt.seconds)/60

Apply filters (method 1) and then select any 2 rows that satisfy the conditions stated in the OP

home_filter = df['Home_dist_miles']<=12000 # within 12,000 miles
work_filter = df['Work_dist_miles']<=8000 # within 8,000 miles
time_filter = df['time_min']<=25 # within 25 minutes
df_filtered = df.loc[(home_filter) & (work_filter) & (time_filter)]

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

Apply filters (method 2) and then select any 2 rows that satisfy the conditions stated in the OP

multi_query = """Home_dist_miles<=12000 & \
                Work_dist_miles<=8000 & \
                time_min<=25"""
df_filtered = df.query(multi_query)

# Select any 2 rows that satisfy required conditions
df_any2rows = df_filtered.sample(n=2)
print(df_any2rows)

   ID        Created_Time  Home_Longitude  Home_Latitude  Work_Longitude  Work_Latitude  Home_dist_miles  Work_dist_miles     time  time_min
0   0 2012-10-01 00:00:00     -168.956448     -42.970705       -6.340945     -12.749469         0.000000         0.000000 00:00:00       0.0
4   4 2012-10-01 00:20:00      -73.120352      13.748187      -36.953587      23.528789      6259.078588      5939.425019 00:20:00      20.0

Hey thanks for the input but im not sure if I follow this logic - all the rows are different in the table and dont rely on the first row - the data for example is from different days. There can be a created time of 3:03pm and 10:03 pm but both can have other orders created within 3 minutes lets say of both of their respective orders? @edesz — Chris90, Mar 07 '19 at 05:41
Right, they don't need to follow the first row. I only chose the first row for convenience but you could choose any other row. — edesz, Mar 07 '19 at 06:30
But I still dont understand - I am trying to extract any 2 rows that satisfy those conditions — Chris90, Mar 07 '19 at 09:22
@Chris90, please see the updated answer. I had added code to [extract any 2 of the rows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html#pandas-dataframe-sample) (another example [here](https://www.geeksforgeeks.org/python-pandas-dataframe-sample/)) from the filtered `DataFrame` that satisfies the conditions you stated in the OP. — edesz, Mar 16 '19 at 20:16

segmenting or grouping a df based on parameters or differences within columns going down the dataframe rows?

1 Answers1

Linked