-2

I am trying to match with regex HTML entity range &#1488-&#1517. I want to surround any series of those charachters, or whitespace, with a

 <div>(match)</div>

So far I have

 (\&\#[1][5|4][0-9][0-9]\;\s*)

But this returns multiple match groups, which means each character will have a <div> around it. I want the entire group to have one div before and a close div after.

How can this be done with regex?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
danielb
  • 878
  • 4
  • 10
  • 26
  • 3
    Never, ever use regex to parse HTML. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Bobby Russell Nov 10 '15 at 21:49
  • Change to [4-5][0-9]{2}; and manually validate the number. Easy peasy lemon squeezy. – Darth Egregious Nov 10 '15 at 21:50
  • @Fuser97381 that does not work. I need it to select all in one group... – danielb Nov 10 '15 at 21:58
  • What alternative to regex is there for coloring hebrew text red in a long string with alternating hebrew and english, if the hebrew is encoded in html entities (not in my control)? – danielb Nov 10 '15 at 22:25
  • @danielb If that is your use case, you should consider using `span` tags instead of `div`. – Mike Brant Nov 10 '15 at 22:42
  • I'm voting to close this question as off-topic because [use regex to parse HTML really is a bad idea.](http://stackoverflow.com/a/1732454/5299236) – Remi Guan Nov 11 '15 at 00:14
  • 1
    @BobbyRussell it doesn't seem like you read past the 8th word here. I am not using regex to parse HTML. I am using it to match certain characters which all match the patter xxx; where xxxx is a number range -- regex seems to be perfectly suited for this task. Ya'll are so quick to jump to close/downvote, I really don't think many of you even bothered to read the whole question... :/ – danielb Nov 11 '15 at 09:08
  • @KevinGuan it doesn't seem like you read past the 8th word here. I am not using regex to parse HTML. I am using it to match certain characters which all match the patter xxx; where xxxx is a number range -- regex seems to be perfectly suited for this task. Ya'll are so quick to jump to close/downvote, I really don't think many of you even bothered to read the whole question... :/ – danielb Nov 11 '15 at 09:09
  • 2
    @BobbyRussell and others: don't mindlessly parrot back this "don't parse html using regex" without thinking. This question is not asking to parse HTML, it's asking to extract HTML entities from a block of text. This is actually a perfect use case for regex. Think first. – Darth Egregious Nov 11 '15 at 15:07
  • 1
    Sorry, jumped too quickly to answer this one – Bobby Russell Nov 11 '15 at 16:39

3 Answers3

1

If you want to match a group of these entities separated by optional whitespace:

&#(?:148[8-9]|149\d|150\d|151[0-7]);(?:\s*&#(?:148[8-9]|149\d|150\d|151[0-7]);)*

Result for Hello &#1488;&#1489; World

 **  Grp 0 -  ( pos 6 , len 14 ) 
&#1488;&#1489;  

Formatted:

 &\#  
 (?:
      148 [8-9] 
   |  149 \d 
   |  150 \d 
   |  151 [0-7] 
 )
 ;
 (?:
      \s* 
      &\#  
      (?:
           148 [8-9] 
        |  149 \d 
        |  150 \d 
        |  151 [0-7] 
      )
      ;
 )*
0

A regex to match that range (with unlimited trailing whitespace) might look like:

/(&#1(48[8-9]|49[0-9]|50[0-9]|51[0-7]);\s*)/g

Or the shorter (but not as easily read IMO):

/(&#1(48[8-9]|(49|50)[0-9]|51[0-7]);\s*)/g

Regex test link

In PHP (flagged as your language), you would make your match using preg_match_all rather than using g as pattern modifier. Fore a replacement, PHP's preg_replace() automatically operates in global mode up to the number of replacements specified by 3rd parameter (if specified).

So code for regex replacement in PHP might look like:

$string = 'Hello &#1488;&#1489; World';
$regex = '/(&#1(48[8-9]|49[0-9]|50[0-9]|51[0-7]);\s*)/';
$replacement = '<div>$1</div>';
$string_with_divs = preg_replace($regex, $replacement);

Edit: To match one or more consecutive occurrences of this pattern and to put a single div wrapper around them all, you would just need to modify the pattern as follows:

$regex = '/((&#1(48[8-9]|49[0-9]|50[0-9]|51[0-7]);\s*)+)/';
Mike Brant
  • 70,514
  • 10
  • 99
  • 103
  • Doesn't work. Try it with Hello אב World – danielb Nov 10 '15 at 22:11
  • @danielb Didn't know your use case was a global search use case. I have modified the regex in my answer above. – Mike Brant Nov 10 '15 at 22:14
  • But this returns multiple match groups, which means each character will have a
    around it. I want the entire group to have one div before and a close div after.
    – danielb Nov 11 '15 at 09:14
  • @danielb So you want the expression to match one or more instances of that pattern and to put a single div wrapper around all consecutive instances? I have added an edit to my answer to show the pattern to use in this case. – Mike Brant Nov 11 '15 at 14:46
0

If anyone runs into this, here's what I used:

$string = 'Hello &#1488;&#1489; World';
$regex = '/((?:&#1[4-5]\d\d\;\s*)+)/';
$replacement = "<span style='color:red'>$1</span>";
$str= preg_replace($regex, $replacement, $str);

https://regex101.com/r/lS4gK0/1

danielb
  • 878
  • 4
  • 10
  • 26