0

I need to extract the number 5 in the brackets in this HTML code:

<td class="th-clr-cel th-clr-td th-clr-pad th-clr-cel-dis" style="width:226px; text-align:left; ">
<span class="th-tx th-tx-value th-tx-nowrap"  style="width:100&#x25;; "  title="Social&#x20;Insurance&#x20;Number&#x20;&#x28;SIN&#x29;" id="C29_W120_V121_builidnumber_table[5].type_text" f2="C;40">
    Social&#x20;Insurance&#x20;Number&#x20;&#x28;SIN&#x29;
</span>

This is just an extract of the whole HTML code and there is much more actual code before and after this sample. But one thing is for sure, the word "Insurance" only appears in this sample.

I managed to match whatever is between the 2 instances of "Social Insurance Number" with this regex:

((?<=Social&#x20;Insurance&#x20;Number)(.*)(?=Social&#x20;Insurance&#x20;Number))

Now I need to combine that and extract the number 5 within the square brackets.

Please note: the content of the bracket could be multiple chracters (i.e.: 15), but it will always be a numeral.

Thank you

EDIT: The reason I want to use regex to parse HTML is because this is part of a JMeter script to run mass performance tests on a system with hundreds of concurrent users. Performance is a major factor here and an XML parser will consume more resources than regex.

Vincent L
  • 699
  • 2
  • 11
  • 25
  • What's the specific problem? The regex for a number contained within other characters? You sure this wouldn't be easier by parsing the HTML? – Dave Newton Sep 08 '21 at 16:24
  • Does `.*\[(\d+)\].*` work for you or am I missing something? – AaronJ Sep 08 '21 at 16:25
  • Like I said, there will be tons of code before and after the sample that I posted here. So there will be tons of other brackets and numbers. I need to extract the one that occurs between the 2 instances of the word "Insurance" – Vincent L Sep 08 '21 at 16:32
  • 1
    I feel like it's been said a million times, but don't use Regex to parse HTML – Mako212 Sep 08 '21 at 16:37
  • https://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not – Mako212 Sep 08 '21 at 16:37
  • The reason I want to use regex to parse HTML is because this is part of a JMeter script to run mass performance tests on a system with hundreds of concurrent users. Performance is a major factor here and an XML parser will consume more resources than regex. – Vincent L Sep 08 '21 at 16:47

3 Answers3

2

This will capture exactly digits under square brackets surrounded by term Insurance:

Insurance(?:[\s\S]*)\[(\d+)\](?:[\s\S]*)Insurance

Demo: https://regex101.com/r/hwFB0Y/3

hitesh bedre
  • 459
  • 2
  • 11
  • 1
    Using `(?:.|\n)*` is a bit inefficient due to the alternation. You can use `[\s\S]*` instead or if supported use an inline modifier `(?s)` to have the dot match a newline. Javascript supports `[^]*` – The fourth bird Sep 08 '21 at 17:26
  • Thank you @Thefourthbird!! Found this platform best only because of people like you. Learning++ – hitesh bedre Sep 08 '21 at 17:35
  • In the character class, the `|` does not mean OR, it means matching a pipe char. But `[\s|\S]*` is unnecessary as `\s` and `\S` already match everything including a pipe :-) – The fourth bird Sep 08 '21 at 17:38
  • 1
    Thank you once again!! You cleared some of my ongoing doubts. – hitesh bedre Sep 08 '21 at 17:45
  • Thank you both! Another good solution. I would have marked it as the answer, but the other one was posted before this one. – Vincent L Sep 08 '21 at 17:53
1

Is that what you're looking for?

((?<=Social&#x20;Insurance&#x20;Number.*\[)(\d+)(?=\].*Social&#x20;Insurance&#x20;Number))
Gonnen Daube
  • 197
  • 7
1

Try this:

Insurance.*\[(\d+)\]

Or if you want to match it between the 2x "Insurance" words

Insurance.*\[(\d+)\][\s\S]+?Insurance

Demo here.

Where

  • Insurance - Match the starting word "Insurance"
  • .* - Match any character
  • \[ - Match the opening bracket
  • (\d+) - Capture the numerical value inside brackets
  • \] - Match the closing bracket
  • [\s\S]+? - Match any character (including newlines) in a non-greedy way so that it wouldn't span across multiple "Insurance" words
  • Insurance - Match the ending word "Insurance"