6

I have a standard email which I am looking to extract certain details from.

Amongst the email are lines like so:

<strong>Name:</strong> John Smith

So to simulate this I have the following JavaScript:

var str = "<br><strong>Name:</strong> John Smith<br>";
var re = /\<strong>Name\s*:\<\/strong>\s*([^\<]*)/g
match = re.exec(str);
while (match != null) {
    console.log(match[0]);
    match = re.exec(str);
}

This only comes out with one result, which is:

<strong>Name:</strong> John Smith

I was hoping to get the capture group ([^\<]*) which in this example would be John Smith

What am I missing here?

npinti
  • 51,780
  • 5
  • 72
  • 96
Graham
  • 7,807
  • 20
  • 69
  • 114
  • 1
    [Obligatory link](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454). – T.J. Crowder Aug 12 '19 at 12:10
  • 1
    I already found that "duplicate" answer and that's where I got my test script from – Graham Aug 12 '19 at 12:14
  • 2
    You needed to read down a bit further in the answer, where he says (hidden in a comment!): "capturing group n: match[n]". If I hadn't already answered this before realizing there had to be a dupetarget, I'd've added a comment for clarity, that's too hidden IMHO. Happy coding! – T.J. Crowder Aug 12 '19 at 12:19

2 Answers2

5

In regular expressions, the first match is always the entire string that was matched. When using groups, you start matching with group 1 and onwards, so to fix your issue simply replace match[0] with match[1].

That being said, since you are using JavaScript, it would be better to process the DOM itself and extract the text from there, as opposed to processing HTML with regular expressions.

npinti
  • 51,780
  • 5
  • 72
  • 96
4

Capture groups are provided in the match array starting at index 1:

var str = "<br><strong>Name:</strong> John Smith<br>";
var re = /\<strong>Name\s*:\<\/strong>\s*([^\<]*)/g
match = re.exec(str);
while (match != null) {
    console.log(match[1]); // <====
    match = re.exec(str);
}

Index 0 contains the whole match.

On modern JavaScript engines, you could also use named capture groups ((?<theName>...), which you can access via match.groups.theName:

var str = "<br><strong>Name:</strong> John Smith<br>";
var re = /\<strong>Name\s*:\<\/strong>\s*(?<name>[^\<]*)/g
// ---------------------------------------^^^^^^^
match = re.exec(str);
while (match != null) {
    console.log(match.groups.name); // <====
    match = re.exec(str);
}
T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875