0

I want to extract only the numbers in an alphanumeric string in lines of HTML code.

Here is a sample:

<td>Simon</td>
<td>Lloyd</td>
<td>Masters</td>
<td>Jan</td>
<td>Dereham</td>
<td data-rating_seq="96">C+</td>
<td>Lorem ipsum dolor sit amet, consectetuer</td>
<td>GI73QEYV486124180989205</td>

Using regexr (an awesome tool by the way) I've found a solution to be:

<td>(.*)</td>\n.<td>[A-z]+(\d+)(?:(\d+)|[A-z]?)+(?=</td)

This is inconvenient because I want all of the digits grouped together.

I've also tried using the lookahead (?=) like this:

<td>(.*)</td>\n.<td>[A-z]+(?=\d)+?(?:(\d+)+|[A-z]?)+(?=</td)

But this misses the 73 at the front. I tried adjusting it to make a sort of (check if it's an alphanumeric) before capturing with (?=[\d|A-z]+<) but that didn't work.

My expression needs to:

  • capture the digits in the string a single capture group
  • capture all of the digits
  • ensure that the capture group has at least 1 digit
  • only capture strings between <td> and </td>

Thus my expected match is the alphanumeric string between <td> and </td>:

GI73QEYV486124180989205

Note: disregard how I built my statement because I'm also trying to capture the string, but I don't have difficulty with that.

I'm trying to think it out, but I keep getting stumped because I am thinking about it like a program loop. I want to do it like this:

Pseudo code:

search for <td> tag
disregard any alpha characters after <td>, but before </td>
require at least one numeric char to be present
begin capture loop
capture all numeric
exclude letters
loop check until </td> tag

The problem is that I would need to make a reg expression group like:

(?:(\d+)+?|[A-z]))+

but I need to somehow require and capture the numeric characters.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Klik
  • 1,757
  • 1
  • 21
  • 38
  • [You can't parse (X)HTML with regex.](http://stackoverflow.com/a/1732454/1529630) – Oriol Jan 24 '15 at 00:29
  • 5
    @Oriol But OP isn't parsing HTML, just matching patterns in a document that happens to be HTML. – Reinstate Monica -- notmaynard Jan 24 '15 at 00:37
  • 5
    You haven't specified your language but I'm afraid whatever the regex flavor, you won't be able to skip characters to cherry-pick only a few in a single capture group with pure regex. As a side note beware of `(\d+)+` and [catastrophic backtracking](http://www.foo.be/docs/tpj/issues/vol1_2/tpj0102-0006.html). – Robin Jan 24 '15 at 00:44
  • Please show the expected matches from your sample. (How is this still a thing we have to ask for on regex questions?) – Evan Davis Jan 24 '15 at 00:46
  • 7
    **Warning: Do not use `[A-z]` in regexes.** It matches all the ASCII letters, but it can also match several punctuation characters whose code points happen to lie between `Z` and `a`. To match just the letters, either use `[A-Za-z]` or set the "ignore case" flag and use `[a-z]`. – Alan Moore Jan 24 '15 at 00:50
  • 2
    @Robin is right, there's no way to combine different parts of the string into a single capture group. The usual solution is to use a regexp to get everything between `` and ``, then use a regexp replace to **remove** all the characters you don't want from the string. – Barmar Jan 24 '15 at 00:52
  • 2
    What language are you using for that? – Casimir et Hippolyte Jan 24 '15 at 00:59
  • @CasimiretHippolyte Ok. This is embarrassing. I have to ask a really dumb question, but at least this is my first dumb question all day. What language are you talking about? I'm using English. – Klik Jan 24 '15 at 01:19
  • @Klik: I mean PHP, Javascript, Python, Ruby, another language? Because all languages have different regex implementations. – Casimir et Hippolyte Jan 24 '15 at 01:23
  • Oh (now strike that question from your mind), I'm doing the regexp in Sublime editor's (on Ubuntu) regexp replace feature. It's to make a small adjustment to a very large data table. – Klik Jan 24 '15 at 01:28
  • 1
    In this case you can try something like: `(?:\G(?!^)|]*>(?=[^0-9<]*[0-9]))\K[^0-9<]*([0-9]+)` as pattern, with `\1` or `$1` as replacement. – Casimir et Hippolyte Jan 24 '15 at 01:36
  • That is matching only 4 results for some reason. I like your thinking though--I had the same idea to try to identify if it was alphanumeric using (?=) and then proceed. You gave me an idea as well. – Klik Jan 24 '15 at 01:44
  • 2
    if you have this amount of data, a better way is to load your file (for example) in php with DOMDocument, then you target td nodes and you makes a basic search/replace with `preg_replace`. however, if sublime text supports it you can try: `(?=[<0-9])(?:\G(?!^)|]*>\K[^0-9<]*(*SKIP)(?=[0-9]))[^0-9<]*([0-9]+)[^<0-9]*` – Casimir et Hippolyte Jan 24 '15 at 01:47
  • I've done that already actually, I'm only still looking into this because I haven't found an answered that has satisfied me yet. – Klik Jan 24 '15 at 01:57
  • The first regex worked for me in Sublime Text 3 (build 3065, x64 Windows) but didn't match those ending with a letter. The second one didn't work in ST3. – AMDcze Jan 24 '15 at 09:40

1 Answers1

0

Short answer is no.


Here's Why

Because of the way regular expression works, it cannot cherry pick characters out of a string and combine them.

Regular expression's limitation is that it can only return strings that are present in the text and no operations on that string can be performed because regular expression is only a way of querying text. Given all of the tools for regular expression it can only query and return whole (connected) instances found inside the search text.

Of course it is possible to return all of the digits separately in the alphanumeric string, but that is not what the question was asking.


Some Alternative Solutions

As suggested by Barmar in the comments, one way to approach the problem is to first remove all of the letter characters in the alphanumeric strings and then go back and collect them (this requires two separate regular expressions).

Another way is collect all of the digits and combine them after they have been collected. This can be done with a regular expression replace.


My Solution

What I ended up doing was a rather ugly regular expression replace statement. I found that at most there would be 2 separate continuous numeric strings mixed in with the alphanumeric strings and so I used the following (with case insensitive flags):

regExp: <td>(.*)</td>\n.<td>[A-Z]+(\d+)(?:(\d+)|[A-Z])+</td>
replace: <td data-status_seq="$2$3">$1</td> 
Klik
  • 1,757
  • 1
  • 21
  • 38