How to join lines not starting with specific pattern to the previous line in UNIX?

Question

Please take a look at the sample file and the desired output below to understand what I am looking for.

It can be done with loops in a shell script but I am struggling to get an awk/sed one liner.

SampleFile.txt

These are leaves.
These are branches.
These are greenery which gives
oxygen, provides control over temperature
and maintains cleans the air.
These are tigers
These are bears
and deer and squirrels and other animals.
These are something you want to kill
Which will see you killed in the end.
These are things you must to think to save your tomorrow.

Desired output

These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.

And what's the pattern? No interpunction and next line starts with lowercase character? Or just "not These"? — Benjamin W., Jun 21 '16 at 16:55
If you can do it in a shell script, call that shell script. That's your one-liner. — Kusalananda, Jun 21 '16 at 17:02
I'm serious. If you have a solution, why do you need another? Is your shell looping solution too slow because you have massive amounts of input, or is there some other issues with it? — Kusalananda, Jun 21 '16 at 17:24
@Kusalananda: Its file sized above 4GB, I am trying to deal with. Apart from time factor(a major reason), as you rightly mentioned , its curiosity! — instinct246, Jun 21 '16 at 18:02
@BenjaminW. : I want all the lines to start with 'These' and the following lines to be concatenated until the next line appears starting with 'These'. — instinct246, Jun 21 '16 at 18:04
@MarkPlotnick: One liner would be good. However I am curious to listen your thoughts/solutions. — instinct246, Jun 21 '16 at 18:05

Benjamin W. · Answer 1 · 2019-04-08T19:05:40.543

With sed:

sed ':a;N;/\nThese/!s/\n/ /;ta;P;D' infile

resulting in

These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.

Here is how it works:

sed '
:a                   # Label to jump to
N                    # Append next line to pattern space
/\nThese/!s/\n/ /    # If the newline is NOT followed by "These", append
                     # the line by replacing the newline with a space
ta                   # If we changed something, jump to label
P                    # Print part until newline
D                    # Delete part until newline
' infile

The N;P;D is the idiomatic way of keeping multiple lines in the pattern space; the conditional branching part takes care of the situation where we append more than one line.

This works with GNU sed; for other seds like the one found in Mac OS, the oneliner has to be split up so branching and label are in separate commands, the newlines may have to be escaped, and we need an extra semicolon:

sed -e ':a' -e 'N;/'$'\n''These/!s/'$'\n''/ /;ta' -e 'P;D;' infile

_{This last command is untested; see this answer for differences between different seds and how to handle them.}

Another alternative is to enter the newlines literally:

sed -e ':a' -e 'N;/\
These/!s/\
/ /;ta' -e 'P;D;' infile

But then, by definition, it's no longer a one-liner.

Thanks Benjamin! As you have rightly mentioned this works fine in GNU but gives the error below in Solaris: "Label too long: :a;N;/\nThese/!s/\n/ /;ta;P;D" (for the first command). "sed: command garbled: N;/ " (For the second command). Nevertheless, this is very useful with the explanations you have provided. I will check further as my script runs on solaris. — instinct246, Jun 22 '16 at 07:17
@instinct246 it might work with literary newlines, see the addition to the answer. — Benjamin W., Jun 22 '16 at 13:32

GMichael · Accepted Answer · 2016-06-22T09:50:48.780

Please try the following:

awk 'BEGIN {accum_line = "";} /^These/{if(length(accum_line)){print accum_line; accum_line = "";}} {accum_line = accum_line " " $0;} END {if(length(accum_line)){print accum_line; }}' < data.txt

The code consists of three parts:

The block marked by BEGIN is executed before anything else. It's useful for global initialization
The block marked by END is executed when the regular processing finished. It is good for wrapping the things. Like printing the last collected data if this line has no These at the beginning (this case)
The rest is the code performed for each line. First, the pattern is searched for and the relevant things are done. Second, data collection is done regardless of the string contents.

Thanks - this works but omits the last line. It would be great if you could explain how the code works while I am trying to increase my literacy in awk. — instinct246, Jun 21 '16 at 18:06

score 2 · Answer 3 · answered Jun 21 '16 at 17:03

2

awk '$1==These{print row;row=$0}$1!=These{row=row " " $0}'

you can take it from there. blank lines, separators,
other unspecified behaviors (untested)

answered Jun 21 '16 at 17:03

tomc

1,146
6
10

Thanks tomc! I will check on this. – instinct246 Jun 21 '16 at 18:08
@Kusalananda is doing a much better job of expanding and explaining – tomc Jun 21 '16 at 18:32

score 2 · Answer 4 · answered Jun 21 '16 at 18:48

another awk if you have support for multi-char RS (gawk has)

$ awk -v RS="These" 'NR>1{$1=$1; print RS, $0}' file

These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.

Explanation Set the record delimiter as "These", skip the first (empty) record. Reassign field to force awk to restructure the record; print record separator and the rest of the record.

That will behave undesirably if `These` appears in the middle of a line. The OP said he's interested in "lines starting with...". Maybe you meant to use `RS='(^|\n)These'` or similar. Also it will compress all chains of whitespace to single blank chars. Maybe you meant to add `-F'\n'`. — Ed Morton, Jun 21 '16 at 19:16

score 2 · Answer 5 · answered Jun 21 '16 at 19:06

$ awk '{printf "%s%s", (NR>1 ? (/^These/?ORS:OFS) : ""), $0} END{print ""}' file
These are leaves.
These are branches.
These are greenery which gives oxygen, provides control over temperature and maintains cleans the air.
These are tigers
These are bears and deer and squirrels and other animals.
These are something you want to kill Which will see you killed in the end.
These are things you must to think to save your tomorrow.

Kusalananda · Answer 6 · 2016-06-21T18:39:18.373

Not a one-liner (but see end of answer!), but an awk-script:

#!/usr/bin/awk -f

NR == 1     { line = $0 }
/^These/    { print line; line  = $0 }
! /^These/  { line = line " " $0 }
END         { print line }

Explanation:

I'm accumulating, building up, lines that start with "These" with lines not starting with "These", outputting the completed lines whenever I find the next line with "These" at the beginning.

Store the first line (the first "record").
If the line starts with "These", print the accumulated (previous, now complete) line and replace whatever we have found so far with the current line.
If it doesn't start with "These", accumulate the line (i.e concatenate it with the previously read incomplete lines, with a space in between).
When there's no more input, print the last accumulated (now complete) line.

Run like this:

$ ./script.awk data.in

As a one-liner:

$ awk 'NR==1{c=$0} /^These/{print c;c=$0} !/^These/{c=c" "$0} END{print c}' data.in

... but why you would want to run anything like that on the command line is beyond me.

EDIT Saw that it was the specific string "These" (/^These/) that was what should be looked for. Previously had my code look for uppercase letters at the start of the line (/^[A-Z]/).

Fantastic! This works and moreover I am able to understand thoroughly how it works (from your detailed explanation.) Thanks! — instinct246, Jun 22 '16 at 05:26

score 0 · Answer 7 · answered Feb 15 '20 at 16:47

Here is a sed program which avoids branches. I tested it with the --posix option. The trick is to use an "anchor" (a string which does not occur in the file):

 sed --posix -n '/^These/!{;s/^/DOES_NOT_OCCUR/;};H;${;x;s/^\n//;s/\nDOES_NOT_OCCUR/ /g;p;}'

Explanation:

write DOES_NOT_OCCUR at the beginning of lines not starting with "These":

/^These/!{;s/^/DOES_NOT_OCCUR/;};
append the pattern space to the hold space

H;
If the last line is read, exchange pattern space and hold space

${;x;
Remove the newline at the beginning of the pattern space which is added by the H command when it added the first line to the hold space

s/^\n//;
Replace all newlines followed by DOES_NOT_OCCUR with blanks and print the result

s/\nDOES_NOT_OCCUR/ /g;p;}

Note that the whole file is read in sed's process memory, but with only 4GB this should not be a problem.

How to join lines not starting with specific pattern to the previous line in UNIX?

7 Answers7

Linked