0

I'm trying to do a regex match in perl but I'm not sure how to get it. Basically, I'd like to extract the numbers from the following string (which may or may not have newlines within it)

                        <strong>
                    word
                        </strong>
                    </td><td align="right">
                            <strong>
                        65&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        5,000&nbsp;&nbsp;
                            </strong>
                        </td><td align="right">
                            <strong>
                        -&nbsp;&nbsp;
                            </strong>

<tr><td colspan="2">word</td><td align="right">65&nbsp;&nbsp;</td><td align="right">5,000&nbsp;&nbsp;</td><td align="right">-&nbsp;&nbsp;</td></tr>

So for the above two strings, I'd like to match: 65; 5,000; and - (which means 0).

user1373317
  • 106
  • 1
  • 10
  • 1
    Your title is misleading. "matchin a regex" is not the same as "matching html with regex". That said, use an html parser – HamZa Jul 22 '14 at 22:24
  • Regular expressions are the wrong tool for this. You should really use an HTML parser. – friedo Jul 22 '14 at 22:27
  • Use an HTML Parser like [`Mojo::DOM`](https://metacpan.org/pod/Mojo::DOM). Regex is not the tool for parsing html like this. – Miller Jul 22 '14 at 22:27
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Jim Garrison Jul 22 '14 at 23:14
  • But what if every single HTML page looks the same, and I'm looking for just this regex? – user1373317 Jul 23 '14 at 19:29

2 Answers2

0

The regular expression you are looking for is the following:

/((\d+(,\d+)?)|-)/s

..., whereas the modifier "s" takes care of the matching across multiple lines.

However, I agree with HamZa that you actually should use a HTML Parser. Especially the "-" sign is very likely to appear somewhere else in the HTML as well. You might extend the regular expression as follows:

/((\d+(,\d+)?)|-)&nbsp;+/s

... but then it already begins to become ugly.

brainbowler
  • 667
  • 5
  • 17
0

Store above string, that you have mentioned, in the variable, let say you have stored it in variable $str then:

    use Data::Dumper;
    my @numbers = ($str =~ /\d+?,*\d+|-/sg);
    print Dumper @numbers;
anurag
  • 202
  • 3
  • 12