I’m trying to reorganize my data: I’ve got hundreds of thousand of lines in a single data.table, and I need to merge some of these lines into one line “events”.
My data is structured in three columns: Vid (“Value identifiers”), Date and Value.
There are four “Value identifiers”, thus four type of lines: T for temperature, X for weight, O for offset, and R for a nine digit code to identify an individual.
Usually there are many O lines (and T lines from time to times), until an “event” is present and composed as follows: one to three R lines and then a variable number of X lines. The event finish either by a series of other O lines (see. Example1), or another detection of an individual (which may be the same, or another): one to three R lines (see. Example2).
Example1:
Vid Date Value
O 08.12.19 02:52 355098
T 08.12.19 02:52 2790
O 08.12.19 02:52 354840
O 08.12.19 02:52 3550
R 08.12.19 03:27 700D98F47
R 08.12.19 03:27 700D98F47
R 08.12.19 03:27 700D98F47
X 08.12.19 03:27 753120
X 08.12.19 03:27 738112
X 08.12.19 03:27 743529
X 08.12.19 03:27 747500
X 08.12.19 03:27 743649
X 08.12.19 03:27 748238
X 08.12.19 03:27 744651
X 08.12.19 03:27 749100
X 08.12.19 03:27 750620
X 08.12.19 03:27 749487
X 08.12.19 03:27 762122
X 08.12.19 03:27 759209
X 08.12.19 03:27 756425
X 08.12.19 03:27 756655
X 08.12.19 03:27 757576
X 08.12.19 03:27 754173
X 08.12.19 03:27 754731
X 08.12.19 03:27 752270
X 08.12.19 03:27 749258
O 08.12.19 03:31 371176
O 08.12.19 03:31 371511
O 08.12.19 03:31 370918
Example2 :
Vid Date Value
O 08.12.19 02:52 355098
O 08.12.19 02:52 354840
T 08.12.19 02:52 2790
O 08.12.19 02:52 3550
R 08.12.19 03:27 700D98F47
R 08.12.19 03:27 700D98F47
X 08.12.19 03:27 753120
X 08.12.19 03:27 738112
X 08.12.19 03:27 743529
X 08.12.19 03:27 747500
X 08.12.19 03:27 743649
X 08.12.19 03:27 748238
X 08.12.19 03:27 744651
X 08.12.19 03:27 749100
X 08.12.19 03:27 750620
X 08.12.19 03:27 749487
X 08.12.19 03:27 762122
X 08.12.19 03:27 759209
X 08.12.19 03:27 756425
X 08.12.19 03:27 756655
R 08.12.19 03:27 600D56F20
X 08.12.19 03:27 754731
X 08.12.19 03:27 752270
X 08.12.19 03:27 749258
X 08.12.19 03:27 749257
O 08.12.19 03:31 371176
O 08.12.19 03:31 371511
O 08.12.19 03:31 370918
I’m trying to merge the data of each "event" into one line, like this:
Date R R R X1 X2 X3
08.12.19 03:27 700D98F47 700D98F47 700D98F47 753120 738112 743529 …
Even with the Date repeating beneath each value would be ok
Moreover, Including the offset before and after the event would be very useful.
Date O1 R R R X1 X2 X3 O2
08.12.19 03:27 355085 700D98F47 700D98F47 700D98F47 753120 738112 743529 … 371176
And for Example 2:
Date O1 R R R X1 X2 X3 O2
08.12.19 03:27 355085 700D98F47 700D98F47 NA 753120 738112 743529 … 371176
08.12.19 03:27 355085 600D56F20 NA NA 754731 752270 749258 … 371176
I’ve seached and only found solutions for similar questions using sed or awk commands on terminals.
awk: https://www.theunixschool.com/2012/05/awk-join-or-merge-lines-on-finding.html
sed: sed: conditional merge of multiple lines
I would prefer to keep all the data cleaning on R, therefore I tried using the package Kmisc to use awk (https://www.rdocumentation.org/packages/Kmisc/versions/0.5.0/topics/awk)
awk("/R/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}", tempfileExample)
And using the command system(paste("sed ...’”) to use sed.
system(paste("sed ':a;$!N;s/\1//;ta;P;D'" ,tempfileExample))
but I can't figure out how to change the part of the codes that indicate how the merging is done: e.g. “a;$!N;s/\1//“
Is it possible to do that using sed or awk, on R? Are there any other better ways do to that on R?
Thanks for your help and time