I have a datatable with many rows and I would like to conditionally group two columns, namely Begin and End. These columns stand for a certain month in which the associated person was doing something. Here is some sample data (you can use R to read in, or find the pure tables below if you dont use R):
# base:
test <- read.table(
text = "
1 A mnb USA prim 4 12
2 A mnb USA x 13 15
3 A mnb USA un 16 25
4 A mnb USA fdfds 1 2
5 B ghf CAN sdg 3 27
6 B ghf CAN hgh 28 29
7 B ghf CAN y 24 31
8 B ghf CAN ghf 38 42
",header=F)
library(data.table)
setDT(test)
names(test) <- c("row","Person","Name","Country","add info","Begin","End")
out <- read.table(
text = "
1 A mnb USA fdfds 1 2
2 A mnb USA - 4 25
3 B ghf CAN - 3 31
4 B ghf CAN ghf 38 42
",header=F)
setDT(out)
names(out) <- c("row","Person","Name","Country","add info","Begin","End")
The grouping should be done as follows: If person A did hiking from month 4 to month 15 and travelling from month 16 to month 24, I would group the consecutive (i.e. without break) activity from month 4 to month 24. If afterwards person A did surfing from month 25 to month 28, I would also add this, and the whole group activity would last from 4 to 28. Now problematic are cases were there are overlapping periods, for example person A might also do fishing from 11 to 31, so the whole thing would become 4 to 31. However, if person A did something from 1 to 2, that would be a separate activity (as compared to 1 to 3, which would also have to be added, because 3 is connected to 4). I hope that was clear, if not you can find more examples in the above code. I am using datatable, because my dataset is quite large. I have started with sqldf so far, but it's problematic if you have so many activities per person (let's say 8 or more). Can this be done in datatable, or plyr, or sqldf? Please note: I am also looking for an answer in SQL because I could use that directly in sqldf or try to convert it to another language. sqldf supports (1) the SQLite backend database (by default), (2) the H2 java database, (3) the PostgreSQL database and (4) sqldf 0.4-0 onwards also supports MySQL.
Edit: Here are the 'pure' tables:
In:
Person Name Country add info Begin End
A mnb USA prim 4 12
A mnb USA x 13 15
A mnb USA un 16 25
A mnb USA fdfds 1 2
B ghf CAN sdg 3 27
B ghf CAN hgh 28 29
B ghf CAN y 24 31
B ghf CAN ghf 38 42
Out:
A mnb USA fdfds 1 2
A mnb USA - 4 25
B ghf CAN - 3 31
B ghf CAN ghf 38 42