3

I have a large text file I would like to put on my ebook-reader, but the formatting becomes all wrong because all lines are hard wrapped at or before column 80 with CR/LF, and paragraphs/headers are not marked differently, only a single CR/LF there too.

What I would like is to replace all CR/LF's after column 75 with a space. That would make most paragraphs continuous. (Not a perfect solution, but a lot better to read.)

Is it possible to do this with a regex? Preferably a (linux) perl or sed oneliner, alternatively a Notepad++ regex.

Olav
  • 33
  • 4

4 Answers4

2
perl -p -e 's/\s+$//; $_ .= length() <= 75 ? qq{\n} : q{ }' book.txt

Perl's -p option means: for each input line, process and print. The processing code is supplied with the -e option. In this case: remove trailing whitespace and then attach either a newline or a space, depending on line length.

FMc
  • 41,963
  • 13
  • 79
  • 132
1

This seems to get pretty close:

sed '/^$/! {:a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta}' ebook.txt

It doesn't get the last line of a paragraph if it's shorter than 75 characters.

Edit:

This version should do it all:

sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g' ebook.txt

Edit 2:

If you want to re-wrap at word/sentence boundaries at a different width (here 65, but choose any value) to prevent words from being broken at the margin (or long lines from being truncated):

sed 's/^.\{0,74\}$/&\n/' ebook.txt | fmt -w 65 | sed '/^$;s/\n//}'

To change from DOS to Unix line endings, just add dos2unix to the beginning of any of the pipes above:

dos2unix < ebook.txt | sed '/^.\{0,74\}$/ b; :a;N;s/\(.\{75\}[^\n]*\)\n\(.\{75\}\)/\1 \2/;ta; s/\n/ /g'
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • Working fine, but compared to the perl solution, didn't remove the DOS line endings (which I of course can remove with 'tr'), and it took a very long time, 10.2 seconds compared to 0.08 for perl. – Olav May 16 '10 at 17:19
1

Not really answering your question, but you can achieve this result in vim using this global join command. The v expands tabs into whitespace when determining line length, a feature that might be useful depending on your source text.

:g/\%>74v$\n/j
blissapp
  • 1,330
  • 12
  • 19
0

The less fancy option would be to replace the cr/lf's that apperar by themselves on a line with a single lf or cr, then remove all the cr/lf's remaining. No need for fancy/complicated stuff.

regex 1: ^\r\n$ finds lone cr/lf's. It is then trivial to replace the remaining ones. See this question for help finding cr/lf's in np++.

Community
  • 1
  • 1
zdav
  • 2,752
  • 17
  • 15
  • Ah, but there are almost no CR/LF's by themselves. Many paragraphs are just short lines, where i want to keep the EOL. I chose column 75 because that will catch most multi-line wrapped paragraphs. I'll probably have to adjust the column number from file to file to get the optimal result. – Olav May 16 '10 at 17:29