How can I merge multiple lines to create exactly two records based on field separators?

Question

I need help writing a Unix script loop to process the following data:

200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||

I have data in above format in a text file. What I need to do is remove newline characters on all lines which have | as the first character in the next line. The output I need is:

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

I need some help to achieve this. These shell commands are giving me nightmares!

Sorry updated the post, I need new lines only if there is no '|' at the beginning of each file. — yugesh, Sep 12 '14 at 05:20

michael · Answer 1 · 2014-09-19T03:10:33.367

The 'sed' approach:

sed ':a;N;$!ba;s/\n|/|/g' input.txt

Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).

EDIT:

To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file), which I'll call option #2:

note that these are two of many possible options with sed. I actually prefer non-sed solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"):
option #1 reads the whole file into memory:
- :a is just a label; N says append the next line to the buffer; if end-of-file ($) is not (!) reached, then branch (b) back to label :a ...
- then after the whole file is read into memory, process the buffer with the substitution command (s), replacing all occurrences of "\n|" (newline followed by "|") with just a "|", on the entire (g) buffer
option #2 just process a couple lines at a time:
- reads / appends the next line (N) into the buffer, processes it (s/\n|/|/); branches (t) back to label :a only if the substitution was successful; otherwise prints (P) and clears/deletes (D) the current buffer up to the first embedded newline ... and the stream continues.
option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.

On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),

$ du -h /tmp/foobar.txt
544M    /tmp/foobar.txt

$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real    0m1.564s
user    0m1.390s
sys 0m0.171s

$ time sed  ':a;N;s/\n|/|/;ta;P;D'  /tmp/foobar.txt  > /dev/null 
real    0m3.418s
user    0m3.239s
sys 0m0.163s

At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4197 11001 99 172427 558888 1 19:22 pts/10   00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 558188 (545M)

And option #2:

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4401 11001 99  3468   864   3 19:22 pts/10   00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 236 (0M)

In summary (w/ commentary),

if you have files of unknown size, streaming without buffering is a better decision.
if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
my personal experience with tuning shell scripts is that awk or perl (or tr, but it's the least portable) or even bash may be preferable to using sed.
yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.

See also http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n — michael, Sep 12 '14 at 05:17
This simply answers the question -- I agree it's not the best solution. No need for the down-votes: I agree that this has caveats. Please follow the link to the (extended) discussion on this solution & alternatives... The main point is that the OP didn't know (specifically wrt `sed`) to search for the question "how to remove newlines"; the rest is just details. — michael, Sep 15 '14 at 06:11
note: I just added a detailed comparison of this `sed` example and @potong's `sed` example. — michael, Sep 19 '14 at 03:13

John1024 · Accepted Answer · 2014-09-12T05:41:07.037

Here is an awk solution:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Explanation:

Awk implicitly loops through every line in the file.

substr($0,1,1)=="|"{printf $0;next}

If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using printf here, as opposed to the more common print, so that newlines are not printed unless we explicitly ask for them.
{printf "\n"$0}

If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).
END{print""}

At the end of the file, print a newline.

Refinement

The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

score 3 · Answer 3 · answered Sep 12 '14 at 06:28

3

This might work for you (GNU sed):

sed ':a;N;s/\n|/|/;ta;P;D' file

This processes the file a line at a time an alternative to @michael_n's which slurps the file content into memory before processing.

answered Sep 12 '14 at 06:28

potong

55,640
6
51
83

I am really confused about how these sed commands work. So now if I have to remove all the occurences in each line of a file until a character occurs in it for 3 times. Eg. a123a234a456a232323a2323 and I need 456a232323a2323 what do I do with sed? – yugesh Sep 12 '14 at 07:54
@yugesh if in your browser you hover over a `sed` tag, choose `info`, there you will find links to help you. As to your other problem it might be best to ask another question so as help others that have a similar one (hint think regexp `{m,n}` and the substitute command). – potong Sep 12 '14 at 08:39
Really we probably don't need to post every variant of 'sed' that can do this. See the comment in my answer that covers this topic in depth, with alternatives, pro's and con's. Here it is again: http://stackoverflow.com/questions/1251999/sed-how-can-i-replace-a-newline-n – michael Sep 15 '14 at 06:06

score 2 · Answer 4 · answered Sep 12 '14 at 06:00

You could do this simply through perl,

$ perl -0777pe 's/\n(?=\|)//g' file
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

score 1 · Answer 5 · answered Sep 12 '14 at 05:43

awk -f test.awk input.txt

test.awk

{
    if($0 ~ /^\|/)
    {
            array[i++] = $0
    }
    else
    {
            for(j=0;j<i;j++)
            {
                    line = line array[j];
            }
            i=0;
            print line
            line = $0;
    }
}

score 0 · Answer 6 · answered Sep 12 '14 at 11:21

0

awk -f inp.awk input | sed '/^$/d'

inp.awk

{
    if($0 !~ /^\|/)
     { 
       print line;
       line = $0;
      }
    else
      {
        line = line $0;
      }
 }

answered Sep 12 '14 at 11:21

Sadhun

264
5
14

@cppcoder : In both my answer and your answer, last line is missed out in the output. Logic needs to be changed a little. – Sadhun Sep 12 '14 at 12:35

How can I merge multiple lines to create exactly two records based on field separators?

6 Answers6

Refinement