1

Wiki-formatting in some kind makes it easy for users to avoid HTML: **bold** oder //italic// for example. What I am looking for is an efficient way to replace such formatting codes with HTML code while preserving stuff that is masked by ''. Example:

Replace **this** but do ''not touch **this**''

Doing this in multiple steps would be quite easy:

preg_match('/(''|**)(.*?)\\1/', ...
if ($match[0] === "''") {
  // Do not touch, further replacements will follow
} else {
  // Replace by HTML
}

The PHP preg_replace() function is quite efficient to replace multiple patterns, because when using arrays for pattern/replace I will only call it once and avoid the calling overhead. Example:

preg_replace(
  array(
    '/\\*\\*(.*?)\\*\\*',
    '/__(.*?)__/',
    '/\\/\\/(.*?)\\/\\/'
  ),
  array(
    '<strong>\\1</strong>',
    '<u>\\1</u>',
    '<i>\\1</i>'
  ),
  $s
)

Btw.: This function will be calles about 100 to 1000 times each time, a dynamic page is created - therefore my need for some efficiency.

So my question is: Is there a way to encode the masking in a regular expression + replacement that I can use with preg_replace() like in the latter example? Of course, nested formatting should remain possible.

What I found here is a way to remove stuff (Condition inside regex pattern), but I cannot apply this to my problem, because the replacement naturally leaves unwanted void tag-pairs:

preg_replace(
  array(
    '/(\'\'(.*?)\'\')|(__(.*?)__)/',
    '/(\'\'(.*?)\'\')|(\\*\\*(.*?)\\*\\*)/',
    '/\'\'(.*?)\'\'/'
  ),
  array(
    '\\1<u>\\4</u>',
    '\\1<strong>\\4</strong>',
    '\\1'
  ),
  $s
);

// Leaves a void <u></u> and <strong></strong> for each masked section

Note: The '' must survive each replacement except the last one, or sections would be demasked early. Therefore the \1 replacement.

Of course I could finally strip all the void tags, but this seems rather stupid to me. And I am quite sure, I just don't see the obvious...

Thanks for any suggestions!

Community
  • 1
  • 1
BurninLeo
  • 4,240
  • 4
  • 39
  • 56
  • How about detecting position of `''` using `strpos`, taking substring from 0 to position of `''`, replace it with regex and then concatenate strings. – Leri Dec 20 '12 at 14:22
  • That get's rather nasty if you can have multiple ''-masks with un-masked content before, between, and behind. This also makes trouble as soon as '' is nested within another code, e.g. ``make //italic but ''**not bold**''//`` - here I would have to do the strpos() searching multiple times. Therefore my search for a regex solution. – BurninLeo Dec 20 '12 at 14:35
  • PS: Another solution I thougt about but have disregarded is to collect all masked comment, store it in an array, do the replacements and restore the masked content thereafter. – BurninLeo Dec 20 '12 at 14:38
  • This sounds a bit like [parsing HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) and [converting wiki markup to HTML](http://stackoverflow.com/questions/45991/whats-the-easiest-way-to-convert-wiki-markup-to-html). Consider using something like a Markdown library that can parse the wiki markup and return the HTML for you. Using regular expressions could work for a defined subset of wiki markup, but if you want to support arbitrary wiki markup, then you'll run into problems. – Anson Dec 20 '12 at 16:36
  • Hi! Thanks for the links. I am fully aware that parsing is no regex job - and I would never try to. My subset, however, is sufficiently small: **, __, // and -- - that's it. And as long as I dont want to allow these formatters masked, regex would be perfect (see examples above). And regex would be quite helpful, if I had the CPU time to run in loops (sse examples) - but the efficiency problem is the point here, why I'd like to find a regular expression solution. – BurninLeo Dec 20 '12 at 19:22
  • 1
    Have you considered caching the processed HTML? Maybe processing it when the info is saved (so add an additional field in your database for the processed HTML, for example). This means you only process the string when it is saved, and not on every request. – Gareth Cornish Jan 26 '13 at 12:42
  • Hmm - that actually is a very good idea. I did not think about this and it may require some modifications, but it is really efficient in my case. Thank you! – BurninLeo Jan 30 '13 at 20:55

0 Answers0