How do I replace a certain amount of whitespace using regex?

Question

I'm using Calibre to convert a PDF to MOBI, but it has trouble interpreting space-indented code blocks. The blocks contain a lot of spaces, but in a lot of different amounts. Some lines are even indented by 31 spaces.

Calibre allows for 3 regexes to do search and replace in the book before it's converted.

This is what I've tried.

\n( *) ( *)([a-zA-Z{};\*\/\(\)&#0-9])

Replace with:

\n\1&nbsp;\2\3

The problem, it only replaces one of the spaces. I want them all replaced with the same abount of  .

I've also tried lazy versions of the first group etc.

Is this one of the cases where regular expressions are insufficient? I think this regex engine is the python standard.

score 2 · Accepted Answer · answered Mar 01 '12 at 02:39

If this were Perl you could replace (\G|\n) with $1 , and if it were a regex engine that allowed limited-width lookbehinds (instead of fixed-width lookbehinds like Python's) you could replace (?<=\n {0,30}) with  ; but as it is, the only way I can think of is to replace something like ((?<=\n)|(?<=\n )|(?<=\n {2})|(?<=\n {3})|(?<=\n {4})|(?<=\n {5})|...|(?<=\n {30})) with   . . . and I suspect that at that point you'll reach a limit on how long Calibre allows the input regex to be. :-/

Another option is to take a completely different approach, and replace (two spaces) with   (non-breaking-space + regular space), without bothering to restrict it to the beginning of a line. I'm guessing that that will satisfy your needs?

It looks better now. I'm never going to get a perfect result, but it's definitely readable, thanks :-) — Steinbitglis, Mar 01 '12 at 02:52

score 1 · Answer 2 · answered Mar 01 '12 at 02:31

1

\s{31} would match exactly 31 white spaces, \s{14,31} 14 to 31

answered Mar 01 '12 at 02:31

And what would I replace that with? – Steinbitglis Mar 01 '12 at 02:32

score 1 · Answer 3 · answered Mar 01 '12 at 02:53

1

Any reason not to just replace ALL spaces by non-breaking spaces? (r/ / /.)

It won't change the appearance of normal English text (except where the source had extraeneous double-spaces) and your code blocks will render correctly.

For fun, my attempt in Python:

>>> eight_spaces = "        hello world!"
>>> re.sub(r"^(|(?:&nbsp;)*)\s",r"\1&nbsp;",eight_spaces)
'&nbsp;      hello world!'

The idea is to replace one space at a time. It doesn't work because the re engine doesn't go back to the start of the line after a match - it consumes the string working left to right.

Note the alternation of (?: )* with the empty string, (|(?: )*), so that the capture group \1 always captures something (even the empty string.)

answered Mar 01 '12 at 02:53

Li-aung Yip

12,320
5
34
49

I have not tested yet, but i suspect that the text would not flow like it should. It's non-breakable space after all. – Steinbitglis Mar 01 '12 at 02:55
If I recall correctly, non-breaking space will still flow across lines - the difference is that adjacent ` ` won't be compacted into one space. Correct me if I'm wrong. – Li-aung Yip Mar 01 '12 at 02:58
@Steinbitglis Ah, you're right - non-breaking space is indeed "non-line-breaking". ;) – Li-aung Yip Mar 01 '12 at 03:00
What may work well here is to add an additional step to this, after replacing all spaces with ` ` replace all ` ` that are not preceeded by a newline or ` ` with a space, some like the following: `s/((?<!\n)|(?<! )) / /g`. This would preserve all of the indentation, while still having normal spaces between text (although consecutive spaces would become one space followed by some number of ` `). – Andrew Clark Mar 01 '12 at 03:18

How do I replace a certain amount of whitespace using regex?

3 Answers3