4

I'm using Calibre to convert a PDF to MOBI, but it has trouble interpreting space-indented code blocks. The blocks contain a lot of spaces, but in a lot of different amounts. Some lines are even indented by 31 spaces.

Calibre allows for 3 regexes to do search and replace in the book before it's converted.

This is what I've tried.

\n( *) ( *)([a-zA-Z{};\*\/\(\)&#0-9])

Replace with:

\n\1 \2\3

The problem, it only replaces one of the spaces. I want them all replaced with the same abount of  .

I've also tried lazy versions of the first group etc.

Is this one of the cases where regular expressions are insufficient? I think this regex engine is the python standard.

Steinbitglis
  • 2,482
  • 2
  • 27
  • 40

3 Answers3

2

If this were Perl you could replace (\G|\n)  with $1&nbsp;, and if it were a regex engine that allowed limited-width lookbehinds (instead of fixed-width lookbehinds like Python's) you could replace (?<=\n {0,30})  with &nbsp;; but as it is, the only way I can think of is to replace something like ((?<=\n)|(?<=\n )|(?<=\n {2})|(?<=\n {3})|(?<=\n {4})|(?<=\n {5})|...|(?<=\n {30}))  with &nbsp; . . . and I suspect that at that point you'll reach a limit on how long Calibre allows the input regex to be. :-/

Another option is to take a completely different approach, and replace    (two spaces) with &nbsp;  (non-breaking-space + regular space), without bothering to restrict it to the beginning of a line. I'm guessing that that will satisfy your needs?

ruakh
  • 175,680
  • 26
  • 273
  • 307
1

\s{31} would match exactly 31 white spaces, \s{14,31} 14 to 31

1

Any reason not to just replace ALL spaces by non-breaking spaces? (r/ /&nbsp;/.)

It won't change the appearance of normal English text (except where the source had extraeneous double-spaces) and your code blocks will render correctly.


For fun, my attempt in Python:

>>> eight_spaces = "        hello world!"
>>> re.sub(r"^(|(?:&nbsp;)*)\s",r"\1&nbsp;",eight_spaces)
'&nbsp;      hello world!'

The idea is to replace one space at a time. It doesn't work because the re engine doesn't go back to the start of the line after a match - it consumes the string working left to right.

Note the alternation of (?:&nbsp;)* with the empty string, (|(?:&nbsp;)*), so that the capture group \1 always captures something (even the empty string.)

Li-aung Yip
  • 12,320
  • 5
  • 34
  • 49
  • I have not tested yet, but i suspect that the text would not flow like it should. It's non-breakable space after all. – Steinbitglis Mar 01 '12 at 02:55
  • If I recall correctly, non-breaking space will still flow across lines - the difference is that adjacent ` ` won't be compacted into one space. Correct me if I'm wrong. – Li-aung Yip Mar 01 '12 at 02:58
  • @Steinbitglis Ah, you're right - non-breaking space is indeed "non-line-breaking". ;) – Li-aung Yip Mar 01 '12 at 03:00
  • What may work well here is to add an additional step to this, after replacing all spaces with ` ` replace all ` ` that are not preceeded by a newline or ` ` with a space, some like the following: `s/((?<!\n)|(?<! )) / /g`. This would preserve all of the indentation, while still having normal spaces between text (although consecutive spaces would become one space followed by some number of ` `). – Andrew Clark Mar 01 '12 at 03:18