PHP Recognize paragraphs in Rich Text

Question

I've a rich text editor for news messages. The frontend shows one paragraph and the user can read the full message once the user clicks "read more".

However this recognition is now done by <div></div> tags, while the editor works with
tags (two for a paragraph).

My current regex is:

"/<div>([^`]*?)<\/div>/is"

How can i extend this to also recognize two
tags right after each other. (Notice, those br tags might contain attributes).

I need to recognize `
...
` as a paragraph but also `...

...` as paragraph. My current code is this: `preg_match("/
([^``]*?)<\/div>/is", $getContent, $matches);` so only when content is place in a div tag its recognized as a paragraph. I want to extend this to also recognized when two (or more) breaks are used. — IMarks, Feb 03 '17 at 10:27
Please see: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — iam-decoder, Feb 03 '17 at 10:32
i'm not looking to edit is, each element in my string wrapped by div tags that gets placed in the matches array. i want to same thing to happen when two
's are detected. — IMarks, Feb 03 '17 at 10:41
...And what about `
...
`? And what about `
...
`? What if someone uses `
`? Do you qualify lists (e.g. `
`) as "paragraphs"? How do you treat titles/subtitles, like `
...
`, `
...
`, `
...
`, etc? What about **nested** `
`s, potentially combined with any of the above? Think carefully about your requirements, and re-consider [this post](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — Tom Lord, Feb 03 '17 at 10:49
Good question,
and
are filtered tags, then cannot pass through the system. Titles are not considered as paragraph, they're part of the paragraph. for
they are simply not used by the RTE, so there are two ways left. the dummy user, those who use the RTE-style and the expert user, those who write it itself with the div tag. I'm aware of the many tiny differences in html but that is filtered with the procedures there will be two instruction sets for expert and dummy. any mid-in user (who thinks, i can do the expert) is calculated in as a tiny-risk we have to accept. — IMarks, Feb 03 '17 at 10:56
In addition, as said the function is for a news module so the users working with it are always instructed and viewed with their first uses. it's not publicly accessable — IMarks, Feb 03 '17 at 10:57
What about `
`, `
`, `
`, `
`, `
`, `
`, `
`, `
`, `
`, ``, `
`, `
`, `
`, `` and `` tags? And you're not currently handling "nested paragraphs" at all - for example, what happens if someone writes `

` when *inside* a `
`? What about all the HTML variants, like `
`? — Tom Lord, Feb 03 '17 at 11:34
those tags will not but used, those tags are page-markup elements and not content-markup. On the nested part. as far as i know regex supports 'the biggest collection' vs 'the smallest collection' paramaters. for me the smallest must count. But that must be tested. I'm able to edit regex, just writing is a bit too advance for me. — IMarks, Feb 03 '17 at 11:38
tl;dr: Detecting "paragraphs" in HTML is tricky. Especially with regex. If the requirements really are as simple as you make about above, then it can be done (and I can knock up a quick answer), but beware: Here be dragons! — Tom Lord, Feb 03 '17 at 11:39
I'm aware of that but as mentioned the assumption is in this project the users are still guided, so the variations are minimal. — IMarks, Feb 03 '17 at 11:43

score 2 · Accepted Answer · edited May 23 '17 at 12:17

As discussed above, beware that using regex to parse HTML, especially for "complex" problems, is generally a bad idea. The following is not a perfect solution, but may be good enough to the simple requirements you've given above:

/(?<=<div>).*?(?=<\/div>)|(?<=<br>\s*<br>).*?(?=<div>|<br>\s*<br>)/is

The (?<=...) and (?=...) are look behinds/aheads, i.e. they assert that those sections of the pattern are present, but are not included in the match result.

I have also used \s* to help catch scenarios where the user types something like:

<br>  <br>

Or:

<br>
<br>

...But as I say, this is still not a perfect solution. If you find the pattern gets too complex, then seriously consider using an XML parser instead. (Or, how about just letting the user enter new lines, and converting these into paragraphs for them? ... Or even, just use an existing WYSIHTML5 library, or a markdown library?)

PHP Recognize paragraphs in Rich Text

...

...

...

1 Answers1