0

I’m trying to reorganize my data: I’ve got hundreds of thousand of lines in a single data.table, and I need to merge some of these lines into one line “events”.

My data is structured in three columns: Vid (“Value identifiers”), Date and Value.

There are four “Value identifiers”, thus four type of lines: T for temperature, X for weight, O for offset, and R for a nine digit code to identify an individual.

Usually there are many O lines (and T lines from time to times), until an “event” is present and composed as follows: one to three R lines and then a variable number of X lines. The event finish either by a series of other O lines (see. Example1), or another detection of an individual (which may be the same, or another): one to three R lines (see. Example2).

Example1:

Vid Date            Value
O   08.12.19 02:52  355098
T   08.12.19 02:52  2790
O   08.12.19 02:52  354840
O   08.12.19 02:52  3550
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
X   08.12.19 03:27  753120
X   08.12.19 03:27  738112
X   08.12.19 03:27  743529
X   08.12.19 03:27  747500
X   08.12.19 03:27  743649
X   08.12.19 03:27  748238
X   08.12.19 03:27  744651
X   08.12.19 03:27  749100
X   08.12.19 03:27  750620
X   08.12.19 03:27  749487
X   08.12.19 03:27  762122
X   08.12.19 03:27  759209
X   08.12.19 03:27  756425
X   08.12.19 03:27  756655
X   08.12.19 03:27  757576
X   08.12.19 03:27  754173
X   08.12.19 03:27  754731
X   08.12.19 03:27  752270
X   08.12.19 03:27  749258
O   08.12.19 03:31  371176
O   08.12.19 03:31  371511
O   08.12.19 03:31  370918

Example2 :

Vid Date            Value
O   08.12.19 02:52  355098
O   08.12.19 02:52  354840
T   08.12.19 02:52  2790
O   08.12.19 02:52  3550
R   08.12.19 03:27  700D98F47
R   08.12.19 03:27  700D98F47
X   08.12.19 03:27  753120
X   08.12.19 03:27  738112
X   08.12.19 03:27  743529
X   08.12.19 03:27  747500
X   08.12.19 03:27  743649
X   08.12.19 03:27  748238
X   08.12.19 03:27  744651
X   08.12.19 03:27  749100
X   08.12.19 03:27  750620
X   08.12.19 03:27  749487
X   08.12.19 03:27  762122
X   08.12.19 03:27  759209
X   08.12.19 03:27  756425
X   08.12.19 03:27  756655
R   08.12.19 03:27  600D56F20
X   08.12.19 03:27  754731
X   08.12.19 03:27  752270
X   08.12.19 03:27  749258
X   08.12.19 03:27  749257
O   08.12.19 03:31  371176
O   08.12.19 03:31  371511
O   08.12.19 03:31  370918

I’m trying to merge the data of each "event" into one line, like this:

Date                R           R           R           X1      X2      X3  
08.12.19 03:27      700D98F47   700D98F47   700D98F47   753120  738112  743529  …

Even with the Date repeating beneath each value would be ok

Moreover, Including the offset before and after the event would be very useful.

Date                O1      R           R           R           X1      X2      X3           O2 
08.12.19 03:27      355085  700D98F47   700D98F47   700D98F47   753120  738112  743529  …    371176

And for Example 2:

Date                O1      R           R           R           X1      X2      X3           O2 
08.12.19 03:27      355085  700D98F47   700D98F47   NA          753120  738112  743529  …    371176
08.12.19 03:27      355085  600D56F20   NA          NA          754731  752270  749258  …    371176

I’ve seached and only found solutions for similar questions using sed or awk commands on terminals.

awk: https://www.theunixschool.com/2012/05/awk-join-or-merge-lines-on-finding.html

sed: sed: conditional merge of multiple lines

I would prefer to keep all the data cleaning on R, therefore I tried using the package Kmisc to use awk (https://www.rdocumentation.org/packages/Kmisc/versions/0.5.0/topics/awk)

awk("/R/{if (x)print x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}", tempfileExample)

And using the command system(paste("sed ...’”) to use sed.

system(paste("sed ':a;$!N;s/\1//;ta;P;D'" ,tempfileExample))

but I can't figure out how to change the part of the codes that indicate how the merging is done: e.g. “a;$!N;s/\1//“

Is it possible to do that using sed or awk, on R? Are there any other better ways do to that on R?

Thanks for your help and time

NSi
  • 15
  • 3
  • Have you tried solutions from here - https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format – Ronak Shah Oct 20 '20 at 10:11
  • this doesn't seems to fit as I need to merge lines from different values. The answer from @potong works well in the terminal. Any idea how to integrate it into R? If I use `system(paste("sed code ,dt))` I get `Error: '\s' is an unrecognized escape in character string starting ""sed -E '/^R/{s/^R\s"`and if I escape those "\" with "\\" I get `"error in running command"`. – NSi Oct 22 '20 at 12:18

1 Answers1

1

This might give you a start (GNU sed):

sed -E '/^R/{s/^R\s+//;:a;N;s/\nR.*\s+/ /;ta;bc;:b;N;:c;s/\nX.*\s/ /;tb;P;D};d' file

Gather up any R records followed by any X records, removing the record identifiers for all records and the date and times for all but the first of the collection.

Delete all other records.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Thanks, it's just missing the NA values when R are only 2 or 1 (see desired output for example 2), but it's already very helpful. `08.12.19 03:27 700D98F47 700D98F47 753120 738112 743529 ... 08.12.19 03:27 600D56F20 754731 752270 749258 749257 ...` – NSi Oct 22 '20 at 12:04
  • Sorry I only noticed now I didn't mark the question as solved (I had given a point but it seems like I don't have enough points to give them). Even if I still haven't figured out how to modify this code to get the NA values (when R lines are less than 3), your solution helped me a lot in the cleaning of my data! Thank you! – NSi Jan 28 '21 at 21:53