My research team is scraping micromobility bike/scooter information every minute (with the actual feed being updated every 3-5 minutes) that provides information on the location of the bike/scooter. Every non-duplicate record is being stored into the 'freeBikeStatus' table:
CREATE TABLE `freeBikeStatus` (
`bike_id` varchar(255) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`lon` double DEFAULT NULL,
`lat` double DEFAULT NULL,
`is_reserved` bigint(20) DEFAULT NULL,
`is_disabled` bigint(20) DEFAULT NULL,
`soc` double DEFAULT NULL,
`provider` varchar(255) DEFAULT NULL,
`system_name` varchar(255) NOT NULL,
`timestamp` bigint(20) NOT NULL,
`vehicle_type` varchar(255) DEFAULT NULL,
PRIMARY KEY (`bike_id`,`system_name`,`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
However, there is a ton of redundant information because the bike itself may not move but would still be recorded in multiple rows with different timestamps. This maybe increasing the size of our db by 40-50x, so a query is needed to reduce the size and eliminate redundant information. As an example:
==============================================
| row | bikeid | lat | lon | timestamp | ... |
==============================================
| 1 | a | X | Y | 1:01 | ... |
| 2 | a | X | Y | 1:03 | ... |
| 3 | a | X | Y | 1:05 | ... |
| 4 | a | X | Y | 1:08 | ... |
| 5 | a | Z | Y | 1:12 | ... |
| 6 | a | Z | Y | 1:15 | ... |
| 7 | a | Z | Y | 1:17 | ... |
| 8 | a | Z | Y | 1:19 | ... |
| 9 | a | X | Y | 1:22 | ... |
| 10 | a | X | Y | 1:25 | ... |
| 11 | a | X | Y | 1:27 | ... |
| 12 | a | X | Y | 1:29 | ... |
Since the bike hasn't actually moved from 1:01 to 1:08, and from 1:12 to 1:19 and from 1:22 to 1:29, the inbetween rows are not necessary. So we would like to change the above table to the table below:
==============================================
| row | bikeid | lat | lon | timestamp | ... |
==============================================
| 1 | a | X | Y | 1:01 | ... |
| 4 | a | X | Y | 1:08 | ... |
| 5 | a | Z | Y | 1:12 | ... |
| 8 | a | Z | Y | 1:19 | ... |
| 9 | a | X | Y | 1:22 | ... |
| 12 | a | X | Y | 1:29 | ... |
I came up with the following query based on a similar StackOverFlow question (Selecting first and last values in a group).
WITH
t1 AS (
SELECT * AS lon, lat
FROM freeBikeStatus
),
t2 AS (
SELECT t1.*,
FIRST_VALUE(timestamp) OVER (PARTITION BY lat, lon ORDER BY timestamp) AS begin,
LAST_VALUE(timestamp) OVER (PARTITION BY lat, lon ORDER BY timestamp) AS end
FROM t1
)
SELECT * FROM t2
GROUP BY lat, lon
ORDER BY lat, lon
However, it seems that there is a slight difference in the example (resulting in the query above). If the bike returns to the same latitude/longitude, I believe the query will eliminate all time points even if it moved from point A -> B -> A, which would cut out one end point and one starting point. Is there a way to modify this query to take this into account?