3

I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.

The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.

Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)

But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.

Here are the gotchyas:

  • String may or may not contain a blockquote block
  • String could contain multiple blockquotes
  • String could contain potentially any level of nesting of blockquotes blocks
  • We can rely on the HTML being properly formed

A sample input string would be look something like something like this:

Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.

To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.

  • 1
    Seems more practical to modify the PDF creator than mickey-mouse around the problem with RegEx. Is this a closed-source library you're using, or something modifiable? – Brad Christie Jan 31 '11 at 16:45
  • @Brad It's this class -> http://www.ros.co.nz/pdf/ <- I originally used it to generate mailing labels for customers but am extending it to generating welcome letters for customers as well. – WhiskeyTangoFoxtrot Jan 31 '11 at 16:48
  • You do realized "blockquote indent" !== "intent 5 spaces at the beginning of the line" right? – timdream Jan 31 '11 at 16:54
  • Bah http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454.textContent – Hello71 Jan 31 '11 at 16:59
  • While I realize it's not as easy as "5 spaces", this is perhaps the simplest way to simulate a blockquote in the PDF document as it prints out each space (unlike how HTML handles spaces). The amount of spaces I am going to use to simulate the effect is probably going to be tweaked or ultimately turned into using \t characters. I'm still in an experimenting phase and just need to get it "to work" – WhiskeyTangoFoxtrot Jan 31 '11 at 17:02
  • @Foxtrot: funny, I download that library and try it and even their "readme.php", which is supposed to be a straight-forward implementation of the library, bombs out with "undefined index" errors. – Brad Christie Jan 31 '11 at 17:43

3 Answers3

4

You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.

Rosh Oxymoron
  • 20,355
  • 6
  • 41
  • 43
4
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~

You will need to run this regex recursively using preg_replace_callback:

const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
    return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}

$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);

My regex assumes, that there won't be any attributes on the blockquote or anywhere else.

(PS: I'll leave the "Use a DOM parser" comment to someone else.)

NikiC
  • 100,734
  • 37
  • 191
  • 225
1

Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily. A simple way way to explain this, is to say that regular expression can't keep a count .. i.e. they can't count the nesting level...

what is you need is a limited CFG ( the paren-counting types ) .. you need to somehow keep a count ..may be a stack or tree ...

vrdhn
  • 4,024
  • 3
  • 31
  • 39
  • The regex flavor PHP uses (PCRE) allows recursion in multiple ways. Using backreferences to capturing groups `(?n)` or using the resursion subgroup `(?R)`. – NikiC Jan 31 '11 at 16:56
  • 1
    @nikic by 'Type -2.5' i had pcre in mind .. i didn't knew php has switched to it .. however the regexp looks hairy ... if solution involved writing a function and recursing in it , tokenzing using
    and
    and \n as seperator, and doing simple scan may be a more readable solution !
    – vrdhn Jan 31 '11 at 17:13