4

I'm doing some trajectory analysis using R and PostgreSQL. In order to form groups of trajectory segments where successive positions are spatio-temporally near, I've created the following table. What I'm still missing is the column group_id, which is what my question is about.

bike_id1    datetime             bike_id2    near     group_id
      1    2016-05-28 11:00:00          2    TRUE            1
      1    2016-05-28 11:00:05          2    TRUE            1
      1    2016-05-28 11:00:10          2    FALSE          NA
[...]
      2    2016-05-28 11:00:05          3    TRUE            1
      2    2016-05-28 11:00:10          3    TRUE            1

This is the result of multiple comparisons between each trajectory with every other (all combinations without repetitions) and an inner join on datetime (sampled always on a multiple of 5 seconds). It shows that for certain positions, bike 1 and 2 were sampled at the same time and are spatially near (some arbitrary threshold).

Now I'd like to give away unique ids for the segments where two bikes are spatio-temporally near (group_id). This is where I'm stuck: I'd want the group_id to respect groups with multiple trajectories. The method for assigning the group_id should realize that if bike 1 and 2 are in a group at 2016-05-28 11:00:05, then 3 belongs to the same group if it is near to 2 at that same timestamp (2016-05-28 11:00:05).

Are there tools within R or PostgreSQL that would help me with this task? Running a loop through the table seems like the wrong way to go about this.

EDIT: As @wildplasser pointed out, this seems to be a gaps-and-islands problem which is traditionally solved using SQL. He has kindly produced some sample data that I have slightly extended and will include in the question.

CREATE TABLE nearness
        -- ( seq SERIAL NOT NULL UNIQUE -- surrogate for conveniance
        ( bike1 INTEGER NOT NULL
        , bike2 INTEGER NOT NULL
        , stamp timestamp NOT NULL
        , near boolean
        , PRIMARY KEY(bike1,bike2,stamp)
        );
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
 (1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)
,(4,5, '2016-05-28 11:00:00', FALSE)
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
Ratnanil
  • 1,641
  • 17
  • 43
  • You need to detect the gaps, so you first have to define what a gap is. After that you can enumerate the non-gaps. – wildplasser May 28 '16 at 12:15
  • you mean using using cumsum along the lines of [this method](http://stackoverflow.com/a/36064324/4139249)? The core of the problem then would still persist: How can you enumerate and still respect the case where a `bike_id`-`datetime` combination already was assigned to a group (especially since the `bike_id` can reside in either column `bike_id1` or `bike_id2`) – Ratnanil May 28 '16 at 12:23
  • The *natural key* for your vicinity/association-table seems to be {bike_id1,bike_id2, datetime} Your "segments" result will have {bike_id1,bike_id2, begintime(->endtime)} as a key. You can enumerate these (after detecting them), even if they overlap ["cumsum": sorry , I don't read R] – wildplasser May 28 '16 at 12:31
  • Are you trying to do some sort of space-time clustering of paths? Is this more a methods question (at the moment) than a programming question? Then throw it over to stats.stackexchange.com – Spacedman May 28 '16 at 12:37
  • @Spacedman: no stats needed (vicinity detection has already been done) It does need some higher level of abstraction (gaps-in-data + enumeration of groups), though, but it can be done in plain SQL, IMO. To the OP: do you have some more data to test on? only two groups is a bit trivial... – wildplasser May 28 '16 at 12:41
  • Yes, I agree. I've already decided on my methodological path, what I need is the programming approach. "gaps-in-data + enumeration of groups" seems to be the keyword here, as a SQL novice I've never heard about this but will dive into it right away. – Ratnanil May 28 '16 at 12:46

1 Answers1

1

UPDATE: [after understanding the real question ;-] Finding the equivalence groups of bikes (set, bike_set) is in fact a relational-division problem. Finding the begin and end of the segments (clust) within a set of bikes is basically the same as in the first attempt.

  • The clusters are stored in arrays: (I trust on the clusters not becoming too large)
  • The array is built by a recursive query: every pair of bikes that has one member in common with the current cluster is merged into it.
  • At the end, the arrays contain all bike_id's that happened to be within reach at the particular time.
  • (plus some intermediate rows which need to be suppressed later by the uniq CTE)
  • The rest is more-or-less standard gap-detection in time-series.

NOTE: the code trusts on (bike2 > bike1). This is needed to keep the array sorted and thus canonical. The actual content is not guaranteed to be canonical because the order of addition in the recursive query cannot be guaranteed. This may need some extra work.


CREATE TABLE nearness
        ( bike1 INTEGER NOT NULL
        , bike2 INTEGER NOT NULL
        , stamp timestamp NOT NULL
        , near boolean
        , PRIMARY KEY(bike1,bike2,stamp)
        );
INSERT INTO nearness( bike1,bike2,stamp,near) VALUES
 (1,2, '2016-05-28 11:00:00', TRUE)
,(1,2, '2016-05-28 11:00:05', TRUE)
,(1,2, '2016-05-28 11:00:10', TRUE)
,(1,2, '2016-05-28 11:00:20', TRUE) -- <<-- gap here
,(1,2, '2016-05-28 11:00:25', TRUE)
,(1,2, '2016-05-28 11:00:30', FALSE)    -- <<-- these False-records serve no pupose
,(4,5, '2016-05-28 11:00:00', FALSE)    -- <<-- result would be the same without them
,(4,5, '2016-05-28 11:00:05', FALSE)
,(4,5, '2016-05-28 11:00:10', TRUE)
,(4,5, '2016-05-28 11:00:15', TRUE)
,(4,5, '2016-05-28 11:00:20', TRUE)
,(2,3, '2016-05-28 11:00:05', TRUE) -- <<-- bike 1, 2, 3 are in one grp @ 11:00:05
,(2,3, '2016-05-28 11:00:10', TRUE) -- <<-- no group here
,(6,7, '2016-05-28 11:00:00', FALSE)
,(6,7, '2016-05-28 11:00:05', FALSE)
        ;


        -- Recursive union-find to glue together sets of bike_ids
        -- ,occuring at the same moment.
        -- Sets are represented as {ordered,unique} arrays here
WITH RECURSIVE wood AS (
        WITH omg AS (
                SELECT bike1 ,bike2,stamp
                , row_number() OVER(ORDER BY bike1,bike2,stamp) AS seq
                , ARRAY[bike1,bike2]::integer[] AS arr
                FROM nearness n WHERE near = True
                )
        -- Find all existing combinations of bikes
        SELECT o1.stamp, o1.seq
                , ARRAY[o1.bike1,o1.bike2]::integer[] AS arr
        FROM omg o1
        UNION ALL
        SELECT o2.stamp, o2.seq -- avoid duplicates inside the array
                , CASE when o2.bike1 = ANY(w.arr) THEN w.arr || o2.bike2
                ELSE  w.arr || o2.bike1 END AS arr
        FROM omg o2
        JOIN wood w
                ON o2.stamp = w.stamp AND o2.seq > w.seq
                AND (o2.bike1 = ANY(w.arr) OR o2.bike2 = ANY(w.arr))
                AND NOT (o2.bike1 = ANY(w.arr) AND o2.bike2 = ANY(w.arr))
        )
, uniq  AS (    -- suppress partial sets caused by the recursive union-find buildup
        SELECT * FROM wood w
        WHERE NOT EXISTS (SELECT * FROM wood nx
                WHERE nx.stamp = w.stamp
                AND nx.arr @> w.arr AND nx.arr <> w.arr -- contains but not equal 
                )
        )
, xsets AS (    -- make unique sets of bikes
        SELECT DISTINCT arr
        -- , MIN(seq) AS grp
        FROM uniq
        GROUP BY arr
        )
, sets AS (     -- enumerate the sets of bikes
        SELECT arr
        , row_number() OVER () AS setnum
        FROM xsets
        )
, drag AS (             -- Detect beginning and end of segments of consecutive observations
        SELECT u.*      -- within a constant set of bike_ids
        -- Edge-detection begin of group
        , NOT EXISTS (SELECT * FROM uniq nx
                WHERE nx.arr = u.arr
                AND nx.stamp < u.stamp
                AND nx.stamp >= u.stamp - '5 sec'::interval
                ) AS is_first
        -- Edge-detection end of group
        , NOT EXISTS (SELECT * FROM uniq nx
                WHERE nx.arr = u.arr
                AND nx.stamp > u.stamp
                AND nx.stamp <= u.stamp + '5 sec'::interval
                ) AS is_last
        , row_number() OVER(ORDER BY arr,stamp) AS nseq
        FROM uniq u
        )
, top AS ( -- id and groupnum for the start of a group
        SELECT nseq
        , row_number() OVER () AS clust
        FROM drag
        WHERE is_first
        )
, bot AS ( -- id and groupnum for the end of a group
        SELECT nseq
        , row_number() OVER () AS clust
        FROM drag
        WHERE is_last
        )
SELECT w.seq as orgseq  -- results, please ...
        , w.stamp
        , g0.clust AS clust
        , row_number() OVER(www) AS rn
        , s.setnum, s.arr AS bike_set
        FROM drag w
        JOIN sets s ON s.arr = w.arr
        JOIN top g0 ON g0.nseq <= w.seq
        JOIN bot g1 ON g1.nseq >= w.seq AND g1.clust = g0.clust
        WINDOW www AS (PARTITION BY g1.clust ORDER BY w.stamp)
        ORDER BY g1.clust, w.stamp
        ;

Result:


 orgseq |        stamp        | clust | rn | setnum | bike_set 
--------+---------------------+-------+----+--------+----------
      1 | 2016-05-28 11:00:00 |     1 |  1 |      1 | {1,2}
      4 | 2016-05-28 11:00:20 |     3 |  1 |      1 | {1,2}
      5 | 2016-05-28 11:00:25 |     3 |  2 |      1 | {1,2}
      6 | 2016-05-28 11:00:05 |     4 |  1 |      3 | {1,2,3}
      7 | 2016-05-28 11:00:10 |     4 |  2 |      3 | {1,2,3}
      8 | 2016-05-28 11:00:10 |     4 |  3 |      2 | {4,5}
(6 rows)
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • First of all: Thank you very much!! Second: Wow.. this goes way over my SQL skills and it will take a while for me to wrap my head around your code. Looking at the result though, I have to say that a key issue doesn't seem to work yet: `bike 2` has two different groups assigned at 11:00:05 (`grp 1` with `bike 1`, `grp 3` with `bike 3`). This shouldn't happen, since `bike 1`, `2` and `3` should be assigned to the same group at 11:00:05. But I have the feeling this requirement would exponentially complicate things.. or am I wrong? – Ratnanil May 29 '16 at 08:20
  • You should have defined *vicinity* in your question. If you want to cluster neigbors_of_neighbors, it becomes a different problem. (think: triangle inequality) – wildplasser May 29 '16 at 09:02
  • sorry for the misunderstanding! I was definitely trying to formulate the issue as clear as I can, but I'm was lacking some necessary keywords. Thanks for this solution! – Ratnanil May 30 '16 at 20:12