I want to extract only the numbers in an alphanumeric string in lines of HTML code.
Here is a sample:
<td>Simon</td>
<td>Lloyd</td>
<td>Masters</td>
<td>Jan</td>
<td>Dereham</td>
<td data-rating_seq="96">C+</td>
<td>Lorem ipsum dolor sit amet, consectetuer</td>
<td>GI73QEYV486124180989205</td>
Using regexr (an awesome tool by the way) I've found a solution to be:
<td>(.*)</td>\n.<td>[A-z]+(\d+)(?:(\d+)|[A-z]?)+(?=</td)
This is inconvenient because I want all of the digits grouped together.
I've also tried using the lookahead (?=)
like this:
<td>(.*)</td>\n.<td>[A-z]+(?=\d)+?(?:(\d+)+|[A-z]?)+(?=</td)
But this misses the 73 at the front. I tried adjusting it to make a sort of (check if it's an alphanumeric) before capturing with (?=[\d|A-z]+<)
but that didn't work.
My expression needs to:
- capture the digits in the string a single capture group
- capture all of the digits
- ensure that the capture group has at least 1 digit
- only capture strings between
<td>
and</td>
Thus my expected match is the alphanumeric string between <td>
and </td>
:
GI73QEYV486124180989205
Note: disregard how I built my statement because I'm also trying to capture the string, but I don't have difficulty with that.
I'm trying to think it out, but I keep getting stumped because I am thinking about it like a program loop. I want to do it like this:
Pseudo code:
search for <td> tag
disregard any alpha characters after <td>, but before </td>
require at least one numeric char to be present
begin capture loop
capture all numeric
exclude letters
loop check until </td> tag
The problem is that I would need to make a reg expression group like:
(?:(\d+)+?|[A-z]))+
but I need to somehow require and capture the numeric characters.