1

Probably I will need to change the title, but I lack the language to express concisely the problem.

I have a big data frame with the following structure:

class(df)
[1] "SpatialPointsDataFrame"

df

           coordinates   ID      group_id value1    time    zone
1  (9.09077, 45.47737) 24739        1      70 1505480403      62
2  (9.24016, 45.50221) 62965        1      35 1505480403      19
3  (9.22375, 45.50891) 93104        1      86 1505480403      16
4  (9.24392, 45.51171) 35646        1      80 1505480403      17
5  (9.24326, 45.51349) 24747        1      80 1505480403      17
6  (9.12738, 45.46933) 44274        1      53 1505480403      60
7  (9.25434, 45.45346) 53579        1      99 1505480403      30
8  (9.20306, 45.42118) 74565        1      37 1505480403      38
9   (9.1183, 45.45458) 74521        1      79 1505480403      53
10 (9.14386, 45.44674) 53584        1      62 1505480403      49
11 (9.14961, 45.51675) 44283        1      80 1505480403      81
12 (9.20478, 45.50888) 62936        1      50 1505480403      13
13   (9.20352, 45.421) 87311        1      80 1505480403      38
14 (9.11081, 45.49871) 43467        1      80 1505480403      65
15 (9.25292, 45.51276) 93101        1      41 1505480403      17
16 (9.08882, 45.45794) 53507        1      26 1505480403      55
17  (9.2071, 45.50587) 53522        1      80 1505480403      13
18 (9.22379, 45.50888) 87372        1      80 1505480403      16
19 (9.24388, 45.50456) 87332        1      21 1505480403      19
20 (9.20917, 45.51816) 47710        1      69 1505480403      15
21 (9.12292, 45.50486) 06213        1      83 1505480403      72
22 (9.21751, 45.51149) 35633        1      80 1505480403      16
23 (8.85098, 45.60636) 35632        1      70 1505480403     N/A
24 (9.08967, 45.46126) 79065        1      29 1505480403      55
25  (9.11043, 45.4958) 72813        1      80 1505480403      65
26 (9.14425, 45.44787) 06221        1      53 1505480403      49
27 (9.21566, 45.51105) 06286        1      81 1505480403      16
28 (9.15909, 45.52463) 53547        1      22 1505480403      82
29  (9.1357, 45.45472) 74513        1      85 1505480403      52
30 (8.85068, 45.60646) 62949        1      80 1505480403     N/A
31 (9.20874, 45.42597) 06296        1      20 1505480403      38
32  (9.1992, 45.51189) 62972        1      84 1505480403      14
33 (9.15462, 45.51798) 43496        1      52 1505480403      81
34 (9.08816, 45.47548) 87351        1      21 1505480403      62
35 (9.13154, 45.43608) 93106        1      50 1505480403      48
36 (9.14139, 45.44461) 14879        1      29 1505480403      48
37 (9.19553, 45.44263) 43464        1      78 1505480403      37
38 (9.14361, 45.48266) 87371        1      80 1505480403      58
39 (9.13611, 45.46989) 36099        1      91 1505480403      57
40 (9.22385, 45.50939) 53590        1      94 1505480403      16
41 (9.08562, 45.45992) 75494        1      40 1505480403      55
42 (9.20702, 45.52272) 24709        1      44 1505480403      14
43 (9.23615, 45.50845) 53577        1      31 1505480403      16
44 (9.08801, 45.47528) 24766        1      64 1505480403      62
45 (9.24426, 45.51104) 62975        1      80 1505480403      17
46  (9.2403, 45.50226) 62965        1      35 1505476803      19
47 (9.22376, 45.50896) 93104        1      86 1505476803      16
48 (9.09076, 45.47738) 24739        1      70 1505476803      62
49 (9.24405, 45.51173) 35646        1      80 1505476803      17
50 (9.24327, 45.51344) 24747        1      80 1505476803      17
106 (9.22914, 45.45137) 30907       1      89 1505376003      29
112 (9.22917, 45.45136) 30907       1      89 1505372403      29
118 (9.2291, 45.45126) 30907        1      89 1505368803      29
124 (9.22917, 45.45137) 30907       1      89 1505365203      29
130 (9.22918, 45.4514) 30907        1      89 1505361603      29

As you can see in the last lines it happens frequently that an observation happens in the same point(or its vicinities), with the same sensor in consecutive probes. Is there a way to transform the dataset to have a start_time (taking the time of the first observation in the point) and a end_time (taking the time of the last observation in the point +60min)?

The dataset has over 675000 observations, but lots of them are "groupable" These are gps sensors on moving veichles, I gather info every hour from them, it happens that the veichles keep the same position through multiple observations. What I want to do is, for every groups of observations (same id, same position in a few meters range, contiguous times) generate a new dataframe where, if the observation is "single" it has start_time = time, end_time = time +60min if, on the other side, there are several observations that can be grouped (as in the lines 106-130 in the example above) it should have start_time = time of the first observation (1505376003 taking the aformentioned example), end_time = time of the last observation +60min (1505361603 +60min)

EDIT2: I was able to identify (using a subset with only one "ID") if time intervals are consecutive (still can't grasp why the difference is in seconds and not in hours) c(NA,diff(round(as_datetime(df$time), units="hours"))==-3600)

I tried looking through various sources to no avail.

Thank you

  • I think you have a few too many moving parts on this question. I tried asking the question terms of time groupings here https://stackoverflow.com/questions/46249312/grouping-by-buckets-of-time (something I couldn't personally solve but surely someone else can). We can generalise it to the time + coordinate groupings afterwards. – lebelinoz Sep 16 '17 at 01:49
  • By the way, I'm not familiar with that time format. What is it? – lebelinoz Sep 16 '17 at 01:50
  • 1
    it's a unix timestamp – Federico Cupellini Sep 16 '17 at 09:22
  • [this](https://stackoverflow.com/questions/31375959/how-to-group-sequential-event-time-sequences-with-breaks-between-events-to-fin) seems useful – Federico Cupellini Sep 16 '17 at 15:53

1 Answers1

0

arrange the DF by timetmp <- arrange(df, time)

create an index to identify the "time clusters" for each ID (I convert the linux to datetime and round it to the closer full hour)

run <- ddply(tmp, .(ID), transform, run=cumsum(c(1, diff(round(as_datetime(time), units="hours"))!=1)))

I wrap it up, assigning to the first and last occurences of the group the values that could have changed (you can choose to use the mean,median or other functions)

final <- ddply(run, .(ID, run), summarize, group_id=first(group_id), longs=first(longs), lats=first(lats), zone=first(zone), value1_start=first(value1), value1_end=last(value1), start_event=first(time), end_event=last(time))

The result is nice and neat

coordinates   ID run group_id zone value1_start value1_end start_event  end_event
1 (9.22888, 45.52752) 30907   1        1     N/A            54          54  1496872804 1496872804
2  (9.2289, 45.52752) 30907   2        1     N/A            52          52  1496959205 1496959205
3 (9.20471, 45.47467) 30907   3        1       3            91          91  1497006004 1497006004
4 (9.20394, 45.46005) 30907   4        1       4            85          85  1497013203 1497016805
5 (9.22661, 45.48286) 30907   5        1      22            71          69  1497369603 1497387604
6 (9.22701, 45.47922) 30907   6        1      22            91          91  1497412804 1497412804