Selecting first and last row by group in a MySQL table with a nuance in the groups

Question

My research team is scraping micromobility bike/scooter information every minute (with the actual feed being updated every 3-5 minutes) that provides information on the location of the bike/scooter. Every non-duplicate record is being stored into the 'freeBikeStatus' table:

CREATE TABLE `freeBikeStatus` (
  `bike_id` varchar(255) NOT NULL,
  `name` varchar(255) DEFAULT NULL,
  `lon` double DEFAULT NULL,
  `lat` double DEFAULT NULL,
  `is_reserved` bigint(20) DEFAULT NULL,
  `is_disabled` bigint(20) DEFAULT NULL,
  `soc` double DEFAULT NULL,
  `provider` varchar(255) DEFAULT NULL,
  `system_name` varchar(255) NOT NULL,
  `timestamp` bigint(20) NOT NULL,
  `vehicle_type` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`bike_id`,`system_name`,`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

However, there is a ton of redundant information because the bike itself may not move but would still be recorded in multiple rows with different timestamps. This maybe increasing the size of our db by 40-50x, so a query is needed to reduce the size and eliminate redundant information. As an example:

==============================================
| row | bikeid | lat | lon | timestamp | ... |
==============================================
|  1  |   a    |  X  |  Y  |   1:01    | ... |
|  2  |   a    |  X  |  Y  |   1:03    | ... |
|  3  |   a    |  X  |  Y  |   1:05    | ... |
|  4  |   a    |  X  |  Y  |   1:08    | ... |
|  5  |   a    |  Z  |  Y  |   1:12    | ... |
|  6  |   a    |  Z  |  Y  |   1:15    | ... |
|  7  |   a    |  Z  |  Y  |   1:17    | ... |
|  8  |   a    |  Z  |  Y  |   1:19    | ... |
|  9  |   a    |  X  |  Y  |   1:22    | ... |
| 10  |   a    |  X  |  Y  |   1:25    | ... |
| 11  |   a    |  X  |  Y  |   1:27    | ... |
| 12  |   a    |  X  |  Y  |   1:29    | ... |

Since the bike hasn't actually moved from 1:01 to 1:08, and from 1:12 to 1:19 and from 1:22 to 1:29, the inbetween rows are not necessary. So we would like to change the above table to the table below:

==============================================
| row | bikeid | lat | lon | timestamp | ... |
==============================================
|  1  |   a    |  X  |  Y  |   1:01    | ... |
|  4  |   a    |  X  |  Y  |   1:08    | ... |
|  5  |   a    |  Z  |  Y  |   1:12    | ... |
|  8  |   a    |  Z  |  Y  |   1:19    | ... |
|  9  |   a    |  X  |  Y  |   1:22    | ... |
| 12  |   a    |  X  |  Y  |   1:29    | ... |

I came up with the following query based on a similar StackOverFlow question (Selecting first and last values in a group).

WITH
  t1 AS (
    SELECT * AS lon, lat
    FROM freeBikeStatus
  ),
  t2 AS (
    SELECT t1.*,
    FIRST_VALUE(timestamp) OVER (PARTITION BY lat, lon ORDER BY timestamp) AS begin,
    LAST_VALUE(timestamp) OVER (PARTITION BY lat, lon ORDER BY timestamp) AS end
    FROM t1
  )
SELECT * FROM t2
GROUP BY lat, lon
ORDER BY lat, lon

However, it seems that there is a slight difference in the example (resulting in the query above). If the bike returns to the same latitude/longitude, I believe the query will eliminate all time points even if it moved from point A -> B -> A, which would cut out one end point and one starting point. Is there a way to modify this query to take this into account?

Just curious.What are the possible values of is_reserved and is_disabled? — Strawberry, Jul 14 '20 at 19:21

forpas · Answer 1 · 2020-07-14T19:35:32.067

Use LAG() and LEAD() window functions to check the previous and next values of lat and lon of each row:

with cte as (
  select *,
    lag(lat) over (partition by bikeid order by timestamp) prev_lat,
    lead(lat) over (partition by bikeid order by timestamp) next_lat,
    lag(lon) over (partition by bikeid order by timestamp) prev_lon,
    lead(lon) over (partition by bikeid order by timestamp) next_lon
  from freeBikeStatus  
)
select `row`, bikeid, lat, lon, timestamp
from cte
where 
     (lat, lon) <> (prev_lat, prev_lon) 
  or (lat, lon) <> (next_lat, next_lon)
  or coalesce(prev_lat, prev_lon) is null
  or coalesce(next_lat, next_lon) is null
order by `row`

See the demo.
Results:

| row | bikeid | lat | lon | timestamp |
| --- | ------ | --- | --- | --------- |
| 1   | a      | X   | Y   | 1:01      |
| 4   | a      | X   | Y   | 1:08      |
| 5   | a      | Z   | Y   | 1:12      |
| 8   | a      | Z   | Y   | 1:19      |
| 9   | a      | X   | Y   | 1:22      |
| 12  | a      | X   | Y   | 1:29      |

Selecting first and last row by group in a MySQL table with a nuance in the groups

1 Answers1