1

For data-cleaning purposes, I need to move punctuation (commas and periods) that occur right before certain closing tags (a, b,i, strong, em) to the other side of those closing tags.

For example, this bit of text:

<p>Lorem ipsum dolor sit <i>amet,</i> consectetur adipiscing elit.</p>

Should be transformed into this:

<p>Lorem ipsum dolor sit <i>amet</i>, consectetur adipiscing elit.</p>

If possible, it would be neat if the RegEx could also move spaces which occur at the end of tags, though I imagine this could be accomplished by simply running the preg_replace twice, once for spaces, and again for punctuation. For instance:

<p>Lorem ipsum dolor sit <i>amet, </i>consectetur adipiscing elit.</p>
<p>Lorem ipsum dolor sit <i>amet</i>, consectetur adipiscing elit.</p>
Illya Moskvin
  • 294
  • 1
  • 4
  • 15
  • Asking us to write it for you is not how to use this site. Good luck, and god-speed! I recommend you start by writing a regex that puts the characters you want to move in a capture group, and then moves them to the other side in the replacement string. – 4castle Sep 06 '16 at 06:07
  • should it consider such case ` amet,go, ` ? – RomanPerekhrest Sep 06 '16 at 06:07
  • @RomanPerekhrest: Good point. I think not – this question is primarily concerned with cleaning punctuation near the closing tag, so figuring out whether punctuation inside the tag should be followed by a space, or figuring out whether a space after the opening tag should be moved to the other side of that opening tag, seems to fall outside the current scope. – Illya Moskvin Sep 06 '16 at 06:12
  • Should the question title be edited to clarify this point? I tried to come up with a more accurate title, but everything I tried seemed too verbose. If someone has a better idea, please feel free to edit. – Illya Moskvin Sep 06 '16 at 06:15
  • @4castle: FWIW, I'm planning to answer this question myself :) – Illya Moskvin Sep 06 '16 at 06:16
  • @IllyaMoskvin Please mark one of the answers below as the accepted answer, so this question is deemed resolved. – mickmackusa Mar 29 '17 at 08:32

2 Answers2

2

This method uses two capture groups: one captures the comma or period followed by zero or more spaces, the second captures the closing tag. preg_replace is used to reverse their order.

$string = '<p>Lorem ipsum dolor sit <i>amet, </i>consectetur adipiscing elit.</p>';
$pattern = '/([,.] *)(<\/(?:a|b|em|i|strong)>)/g';
$replacement = '$2$1';

$result = preg_replace( $pattern, $replacement, $string );

Here is an online demo.

Illya Moskvin
  • 294
  • 1
  • 4
  • 15
  • Fixed. Feel free to edit the answer if it is unsatisfactory. Thanks for the tip re: `\0`, that's neat! – Illya Moskvin Sep 06 '16 at 06:45
  • 1
    Looking good now :) It's always better to use a non-capturing group when possible because it executes faster, and doesn't mess with the capture groups in the match. – 4castle Sep 06 '16 at 06:51
1

Ignoring all the issues about the horrors awaiting the regex parsing of HTML, this works for me:

$re = "/([\\W]+)(<\\/(a|b|em|i|strong)>)/"; 
$str = "<p>Lorem ipsum dolor sit <i>amet, </i>consectetur adipiscing elit.</p>"; 
$subst = "$2$1"; 

$result = preg_replace($re, $subst, $str);

You can check it out online here.

Community
  • 1
  • 1
Ken Y-N
  • 14,644
  • 21
  • 71
  • 114
  • 2
    No need to double escape everything: `([\W]+)(\<\/\b(a|b|em|i|strong)\b\>`. Additionally, if you use another delimiter (e.g. `~`), your regex becomes even clearer: [**`~(\W+)(\b(a|b|em|i|strong)\b>)~`**](https://regex101.com/r/hR2wY6/2) – Jan Sep 06 '16 at 06:39
  • The `\b` isn't needed, because you have character literals on both sides of those words which aren't words. – 4castle Sep 06 '16 at 06:47
  • The double escapes are coming from regex101's code generator. I've got rid of the `\b`s and a couple of other unnecessary escape characters however. – Ken Y-N Sep 06 '16 at 06:52
  • 1
    If you use single quotes for a regex in PHP, it will never have the issue of mistakenly using string escape sequences. Yeah, I'm not sure why the code generator does that :/ – 4castle Sep 06 '16 at 06:53
  • FWIK it double escapes backslashes within double-quoted string to ensure a `\3` or such octal escape sequences are not interpreted as such `\003` that denotes `End of Text` character instead of a back-reference to third capturing group. @4castle – revo Sep 06 '16 at 07:44