Delete newline character in text file if next line is less than a certain length

Question

I'd like to create a script with any combination of bash, sed, awk, or perl that deletes the newline character of a line if the next line is less than a certain length. Let's say we want to delete the newline character if the next line is less than 5 characters. If we have this source text file:

hi hi hi hi hi
bye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pants
belt
paper paper paper

Here's the desired output:

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

Here's a script that identifies all the lines that are less than 5 characters:

cat source.txt | awk 'length($0) < 5 { print NR }'

It returns this.

2
7

Here's a script that gets rid of the newlines (it's the line numbers from the previous script minus one):

perl -pe 'chomp if $.==1||$.==6' source.txt

How do I combine these two scripts? Or is there a better way to solve this?

Update

There were multiple correct answers (some didn't work on my Mac, but I think they'd work on other machines). Here's how long the correct answers took on my machine with a 769,811 line CSV file (40,000 lines had the newline character removed).

Ed Morton's awk solution: 23.7 seconds
wolfrevokcats perl with slurp: 4.5 seconds
John1024's solution didn't work on my Mac (but think it works on other OSs)
ikegami's perl without slurp: Killed the task after 7 minutes

Please specify what should happen if `printf '%s\n'foobar a b c d > source.txt`. That is, should the output be the equivalent of `printf "%s\n" foobara bc d`, or `printf "%s\n" foobarabcd`, or what? — agc, Mar 09 '18 at 14:54

Ed Morton · Answer 1 · 2018-03-08T23:48:41.973

4

As in life, in software it's much easier to do things based on what has happened rather than what will happen. Don't think of any problem has needing to do X if the NEXT line contains Y, think of it as needing to do Z if the CURRENT line contains Y and then the solution is always simple and obvious, e.g.:

$ cat tst.awk
NR>1{ printf "%s%s", prev, (length() < 5 ? "" : ORS) }
{ prev = $0 }
END{ print prev }

$ awk -f tst.awk file
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

In the above we print a newline if the CURRENT line length is 5 or more. It's clear and simple and will work with any awk in any shell on any UNIX box.

edited Mar 08 '18 at 23:48

answered Mar 08 '18 at 22:18

Ed Morton

188,023
17
78
185

1

I completely agree with your intro, but I wonder if it is better captured with `{printf "%s%s", (length >= 5 ? sep : ""), $0; sep = ORS}END{printf "%s",sep}` :) Or even `length >= 5 { printf "%s", sep } { printf "%s", $0; sep = ORS; } END { printf "%s", sep }` – rici Mar 08 '18 at 22:42

wolfrevokcats · Accepted Answer · 2018-03-08T23:44:17.767

perl -p0777e "s{\r?\n(?=.{0,5}$)}{}mg" test.txt

output

hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

[ Well it took me 2 minutes to write the one-liner and about an hour to explain. ]

Here's the explanation:

Switches:

-p - read every line of the input files, run the code specified by -e for each line, and print the variable $_ (which is modified by the -e code)

-0[octal number] - input line separator; if we specify 0777 the whole file will be considered a line and read at once

-l - strip input lines from ending \n, set the output line separator equal to the input line separator. (I removed it, cause it's actually not needed here)

Now the regular expression:

s{\r?\n(?=.{0,5}$)}{}mg

s{pattern}{replacement} - search for pattern in variable $_ and replace it with replacement

pattern parts:

\r?\n - match every newline symbol. For Unix \n would be enough, \r? - optional match of CR that may be necessary for old perl versions under Windows. Actually I think \r? can be removed too.

(?=pattern) - a positive look-ahead match of pattern, a zero width match, that is it does not consume the characters.

.{0,5}$ - match from zero to five characters ending with

s{}{} operator modifiers: m - multiline matching, makes $ match just before \n everywhere in text, not only at the end of the line. g - global matching, replace every occurrence in the text.

Finally, how it all works:

Perl slurps the whole file (-0777) and (-p), then it searches for every occurence of \r?\n that is followed by no more than 5 non-newline characters and a newline: (?=.{0,5}$).
Every occurrence is replaced by the empty string {}.

I think I've been clear enough.

Additional information can be obtained from: perldoc perlre, perldoc perlop , perldoc perlrun.

You have it backwards. `\r?\n` is only useful on unix (to handle both unix and Windows files). `\n` will handle both unix and Windows files on Windows. — ikegami, Mar 11 '18 at 01:31
When I was writing about old versions of perl for Windows I actually meant cygwin perl, I just didn't recall that at once. I used cygwin perl quite a long ago, so that's where it came from. And I've just checked: cygwin based perl still has that quirk. — wolfrevokcats, Mar 11 '18 at 02:06

ikegami · Answer 3 · 2018-03-09T06:05:53.867

If you want to avoid slurping and you want to look ahead, the general solution is to buffer as many lines as you want to look ahead. One in this case.

perl -ne'
   chomp;
   if (length >= 5) {
      print "$buf\n";
   } else {
      print $buf;
   }

   $buf .= $_;

   END { print "$buf\n" if defined $buf; }
'

In this particular case, you can make do with the following:

perl -pe'chomp; print "\n" if length >= 5 && $. > 1; END { print "\n" if $. }'

Both of these solutions handle inputs that don't have a line feed on the last line.

See Specifying file to process to Perl one-liner for usage.

John1024 · Answer 4 · 2018-03-09T18:58:13.950

0

sed is also good for simple substitutions such as this:

$ sed -E ':a; N; s/\n(.{,4})$/\1/; ba' source
hi hi hi hi hibye
fun fun fun fun fun
batman
shirt shirt shirt
pants pants pantsbelt
paper paper paper

How it works:

:a

This defines a label a.
N

This reads in the next line and appends it (with a newline character) to the current contents of the pattern space.
s/\n(.{,4})$/\1/

If a newline character occurs within 4 characters before the end of the current line, then remove the newline
ba

If the above substitution command resulted in a change to the line, then jump back to label a.

BSD/MacOs

The above was tested with GNU sed. For BSD/macOS sed, try:

sed -E -e :a -e N -e 's/\n(.{,4})$/\1/' -e ba source

edited Mar 09 '18 at 18:58

answered Mar 08 '18 at 22:28

John1024

109,961
14
137
171

This doesn't actually work on my machine (I'm on a Mac). It just prints out the same text file that was input without removing any newlines. – Powers Mar 08 '18 at 22:45
@Powers Sorry about that. It works for me on Linux. I just added to the end of the answer code that I expect will work on a Mac. – John1024 Mar 08 '18 at 22:48
The OP says ` delete the newline character if the next line is *less* than 5 characters`. – potong Mar 09 '18 at 10:42
@potong Good eye! I just updated the answer for _less_ than 5 characters. – John1024 Mar 09 '18 at 18:58
By swapping the `ba` for `ta` and adding `P;D` the sed script could change from slurping the file to reading no more than 2 lines at a time. – potong Mar 10 '18 at 00:34

score 0 · Answer 5 · answered Mar 09 '18 at 17:12

0

You can try this sed (ok on OpenBSD)

sed -e '$b' -e 'N;/\n...../{P;D' -e '};y/\n/ /;s/ \([^ ]*$\)/\1/' infile

answered Mar 09 '18 at 17:12

ctac_

2,413
2
7
17

Delete newline character in text file if next line is less than a certain length

5 Answers5

BSD/MacOs