Parse CSV data between two strings and include string from line below

Question

I have files containing data sampled at 20Hz. Appended to each line are data packets from an IMU that are not synchronised with the 20Hz data. The IMU data packets have a start marker (255,90) and an end marker (51). I am using the term packet for brevity, they are just coma separated variables. Packet1 is not the same length as packet2 and so on.

"2019-12-08 21:29:11.90",3390323,..[ CSV data ]..,1,"1 1025.357 ",[ incomplete packet from line above ],51,255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ]

"2019-12-08 21:29:11.95",3390324,.............,1,"1 1025.367 ",[ Second part of packet4 ],51,255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ]

I would like to parse the file so that I extract the time stamp with the IMU packets from the first start marker to after the last start marker and take the partial packet from the next line and append it to the end of the line so the output is in the form:

"2019-12-08 21:29:11.90",255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ][ Second part of packet4 ],51

"2019-12-08 21:29:11.95",255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ][ Second part of packet8 ],51

As requested I have included my real world example: This is just five lines. The last lines would be deleted as it would remain incomplete.

"2019-08-28 10:43:46.2",10802890,32,22.1991,-64,"1 1015.400 ",0,0,0,0,67,149,115,57,11,0,63,24,51,255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0

"2019-08-28 10:43:46.25",10802891,32,22.1991,-64,"1 1015.400 ",0,0,0,0,118,76,101,57,11,0,32,249,51,255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0

"2019-08-28 10:43:46.3",10802892,32,22.1991,-64,"1 1015.400 ",0,0,0,0,151,113,95,57,11,0,72,194,51,255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0

"2019-08-28 10:43:46.35",10802893,32,22.1991,-64,"1 1015.400 ",0,0,0,0,110,51,95,57,11,0,9,37,51,255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0

"2019-08-28 10:43:46.4",10802894,32,22.1991,-64,"1 1015.400 ",0,0,0,0,173,103,97,57,11,0,185,229,51,255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0

I would like to parse the data to the following format:

"2019-08-28 10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51

"2019-08-28 10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51

"2019-08-28 10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51

"2019-08-28 10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51

"2019-08-28 10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0

This last line would remain incomplete as there is no next line.

Great, looks fun. So just do it. Something is stopping you? What is a marker? What happened to `............,1,"1 1025.367 "`? — KamilCuk, Jan 23 '20 at 19:40
' What happened to ............,1,"1 1025.367 "? ' I don't need it in the final output — lethgowerboy, Jan 24 '20 at 16:02

Enlico · Answer 1 · 2020-01-25T15:53:02.530

This the following command pipes the your_input_file into a sed command (GNU sed 4.8) that accomplishes the task. At least it works for me with the files you provided (as they are at the time of writing, empty lines included).

cat your_input_file | sed '
s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/
s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/
$!N
H
$!d
${
x
s/^[^"]*//
s/\n\n\([^\n]*\)/,\1\n/g
}'

Clearly you can save the sed script in a file (named for instance myscript.sed)

#!/usr/bin/sed -f
s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/
s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/
$!N
H
$!d
${
x
s/^[^"]*//
s/\n\n\([^\n]*\)/,\1\n/g
}

and use it like this: ./myscript.sed your_input_file.

Note that if the first ,51, on each line is guaranteed to be followed by 255,90, (something which your fourth example violates, ",0,0,0,0,110,51,95,), then the first substitution command reduces to s/,51,/,51\n,/.

Please, test it and let me know if I have correctly interpreted your question. I have not explained how the script works for the simple reason that it will take considerable time for me to write down an explanation (I tend to be fairly meticulous when walking through a script, as you can see here, where I created another analogous sed script), and I want to be sure it does represent a solution for you.

Maybe shorter solutions are possible (even with sed itself). I'm not sure awk would allow a shorter solution; it would certainly offer infinitely more readability than sed, but (I think) at the price of length. Indeed, as you can see from another answer, the awk script is more readable but longer (369 characters characters/bytes vs sed script's 160 bytes).

Actually, even in the world of sed scripts, the one above is fairly inefficient, I guess, as it basically preprocesses each lines and keeps appending each one to all the preceding ones, then it does some processing on the long resulting multiline and prints it to screen.

Which version are you using (`sed --version`)? Are you on MacOS? Check [this](https://stackoverflow.com/q/28072190/5825294) out. — Enlico, Jan 23 '20 at 21:15
Yeah - I am using Mac OSX. I have added my real data to the question. — lethgowerboy, Jan 23 '20 at 21:42
Based on you original example and on the description, after the processing each line should begin with `"date time",255,90`, which is not the case for the output relative to the long example, where all lines start with `"date time",1080289`. Please, clarify and update your question accordingly. Besides I wrote _real-**like**_ to suggest a reduced version of the real data, so as to avoid cluttering the page with to much data. — Enlico, Jan 23 '20 at 21:48
I have edited the long example to be more consistent with my original question. I was afraid too much editing of my real **like** data would introduce errors. — lethgowerboy, Jan 23 '20 at 23:57
First of the five inputs has first packet `255,90,12,110,51`, but this sublist does not immediately follow the date and time in the corresponding output. If it's unclear what's your expected output, it's not easy to help you. — Enlico, Jan 24 '20 at 00:07
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/206588/discussion-between-lethgowerboy-and-enrico-maria-de-angelis). — lethgowerboy, Jan 24 '20 at 13:57

David C. Rankin · Answer 2 · 2020-01-25T03:05:55.110

When you are dealing with fields you should be thinking awk. In this case, awk provides a simple solution -- so long as your record format does not change. While generally, that wouldn't matter, here it does...

Why? Because your wanted output does not match your problem description.

Why? Because in all records other than the fourth, the first 51 ending your data to append to the previous line is located in field 19, (with a ',' as the field-separator) while in the fourth record it is found in field 12.

So normally where you would just scan forward though your fields to find the first 51 eliminating the need to know what field the first 51 is found in -- using that method with your data does not produce your wanted results. (the 3rd output line would have a short-remainder from the 4th input line reducing its length and instead forcing the additional packet data to the fourth line of output)

However, sacrificing that flexibility and considering fields 7-19 to be packets belonging with the previous line allows your wanted output to be matched exactly. (it also simplifies the script, but at the cost of flexibility in record format)

A short awk script taking the file to process as its first argument can be written as follows:

#!/usr/bin/awk -f

BEGIN { FS=","; dtfield=""; packets=""; pkbeg=7; pkend=19 }

NF > 1 {
    if (length(packets) > 0) {          # handle 1st part of next line
        for (i=pkbeg; i<=pkend; i++)    # append packet data though filed 19
            packets=packets "," $i
        print dtfield packets "\n"      # output the date and packet data
        packets=""                      # reset packet data empty
    }
    dtfield=$1                          # for every line, store date field
    for (i=pkend+1; i<=NF; i++)         # loop from 20 to end savind data
        packets=packets "," $i
}

END {
    print dtfield packets "\n"          # output final line
}

Don't forget to chmod +x scriptname to make the script executable.

Example Use/Output

(non-fixed width due to output line length -- as was done in the question)

$ ./imupackets.awk imu

"2019-08-28 10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51

"2019-08-28 10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51

"2019-08-28 10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51

"2019-08-28 10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51

"2019-08-28 10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0

Look things over and let me know if you have questions.

Parse CSV data between two strings and include string from line below

2 Answers2