Conditional merge of multiple lines in R

Question

I’m trying to reorganize my data: I’ve got hundreds of thousand of lines in a single data.table, and I need to merge some of these lines into one line “events”.

My data is structured in three columns: Vid (“Value identifiers”), Date and Value.

There are four “Value identifiers”, thus four type of lines: T for temperature, X for weight, O for offset, and R for a nine digit code to identify an individual.

Usually there are many O lines (and T lines from time to times), until an “event” is present and composed as follows: one to three R lines and then a variable number of X lines. The event finish either by a series of other O lines (see. Example1), or another detection of an individual (which may be the same, or another): one to three R lines (see. Example2).

Example1:

Vid Date            Value
O   08.12.19 02:52  355098
T   08.12.19 02:52  2790
O   08.12.19 02:52  354840
O   08.12.19 02:52  3550
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
X   08.12.19 03:27  753120
X   08.12.19 03:27  738112
X   08.12.19 03:27  743529
X   08.12.19 03:27  747500
X   08.12.19 03:27  743649
X   08.12.19 03:27  748238
X   08.12.19 03:27  744651
X   08.12.19 03:27  749100
X   08.12.19 03:27  750620
X   08.12.19 03:27  749487
X   08.12.19 03:27  762122
X   08.12.19 03:27  759209
X   08.12.19 03:27  756425
X   08.12.19 03:27  756655
X   08.12.19 03:27  757576
X   08.12.19 03:27  754173
X   08.12.19 03:27  754731
X   08.12.19 03:27  752270
X   08.12.19 03:27  749258
O   08.12.19 03:31  371176
O   08.12.19 03:31  371511
O   08.12.19 03:31  370918

Example2 :

Vid Date            Value
O   08.12.19 02:52  355098
O   08.12.19 02:52  354840
T   08.12.19 02:52  2790
O   08.12.19 02:52  3550
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
X   08.12.19 03:27  753120
X   08.12.19 03:27  738112
X   08.12.19 03:27  743529
X   08.12.19 03:27  747500
X   08.12.19 03:27  743649
X   08.12.19 03:27  748238
X   08.12.19 03:27  744651
X   08.12.19 03:27  749100
X   08.12.19 03:27  750620
X   08.12.19 03:27  749487
X   08.12.19 03:27  762122
X   08.12.19 03:27  759209
X   08.12.19 03:27  756425
X   08.12.19 03:27  756655
R   08.12.19 03:27  600D56F20
X   08.12.19 03:27  754731
X   08.12.19 03:27  752270
X   08.12.19 03:27  749258
X   08.12.19 03:27  749257
O   08.12.19 03:31  371176
O   08.12.19 03:31  371511
O   08.12.19 03:31  370918

I’m trying to merge the data of each "event" into one line, like this:

Date                R           R           R           X1      X2      X3  
08.12.19 03:27      700D98F47   700D98F47   700D98F47   753120  738112  743529  …

Even with the Date repeating beneath each value would be ok

Moreover, Including the offset before and after the event would be very useful.

Date                O1      R           R           R           X1      X2      X3           O2 
08.12.19 03:27      355085  700D98F47   700D98F47   700D98F47   753120  738112  743529  …    371176

And for Example 2:

Date                O1      R           R           R           X1      X2      X3           O2 
08.12.19 03:27      355085  700D98F47   700D98F47   NA          753120  738112  743529  …    371176
08.12.19 03:27      355085  600D56F20   NA          NA          754731  752270  749258  …    371176

I’ve seached and only found solutions for similar questions using sed or awk commands on terminals.

awk: https://www.theunixschool.com/2012/05/awk-join-or-merge-lines-on-finding.html

sed: sed: conditional merge of multiple lines

I would prefer to keep all the data cleaning on R, therefore I tried using the package Kmisc to use awk (https://www.rdocumentation.org/packages/Kmisc/versions/0.5.0/topics/awk)

awk("/R/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}", tempfileExample)

And using the command system(paste("sed ...’”) to use sed.

system(paste("sed ':a;$!N;s/\1//;ta;P;D'" ,tempfileExample))

but I can't figure out how to change the part of the codes that indicate how the merging is done: e.g. “a;$!N;s/\1//“

Is it possible to do that using sed or awk, on R? Are there any other better ways do to that on R?

Thanks for your help and time

Have you tried solutions from here - https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format — Ronak Shah, Oct 20 '20 at 10:11
this doesn't seems to fit as I need to merge lines from different values. The answer from @potong works well in the terminal. Any idea how to integrate it into R? If I use `system(paste("sed code ,dt))` I get `Error: '\s' is an unrecognized escape in character string starting ""sed -E '/^R/{s/^R\s"`and if I escape those "\" with "\\" I get `"error in running command"`. — NSi, Oct 22 '20 at 12:18

score 1 · Accepted Answer · answered Oct 20 '20 at 13:28

1

This might give you a start (GNU sed):

sed -E '/^R/{s/^R\s+//;:a;N;s/\nR.*\s+/ /;ta;bc;:b;N;:c;s/\nX.*\s/ /;tb;P;D};d' file

Gather up any R records followed by any X records, removing the record identifiers for all records and the date and times for all but the first of the collection.

Delete all other records.

answered Oct 20 '20 at 13:28

potong

55,640
6
51
83

Thanks, it's just missing the NA values when R are only 2 or 1 (see desired output for example 2), but it's already very helpful. `08.12.19 03:27 700D98F47 700D98F47 753120 738112 743529 ... 08.12.19 03:27 600D56F20 754731 752270 749258 749257 ...` – NSi Oct 22 '20 at 12:04
Sorry I only noticed now I didn't mark the question as solved (I had given a point but it seems like I don't have enough points to give them). Even if I still haven't figured out how to modify this code to get the NA values (when R lines are less than 3), your solution helped me a lot in the cleaning of my data! Thank you! – NSi Jan 28 '21 at 21:53

Conditional merge of multiple lines in R

1 Answers1