1

I have some text similar to this:

<span id="myspan">2,500</span>
<span id="myspan">500</span>

I need a regex pattern to match only the numbers. So, my output for the above matches would be:

  • 2500
  • 500

I have tried this:

(?:\<\bspan\b.*?\bmyspan\b.*?\>)(?<numbers>[,0-9].*?)(?:\</\bspan\b\>)

And this

(?:\<\bspan\b.*?\bmyspan\b.*?\>)(?<numbers>[0-9].*?)(?:\</\bspan\b\>)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
GregoryBrad
  • 1,145
  • 13
  • 18
  • Just an added detail, I need to do this with regex. I do understand that doing some string manipulation with HtmlAgilityPack can work... though my current solution doesn't allow for this. The double parsing suggested by @nikos-m is my best bet for now... Can you do a double parse in one expression ? – GregoryBrad May 21 '15 at 05:39

4 Answers4

5

It looks like you're heading the wrong way. Basically, regular expressions is not the best tool for parsing HTML.

XML parsers can be applied sometimes, but not always since very often html content is not well-formed in terms of xml so it can't be parsed by xml parsers.

However, it is easy to achieve your goal using Html Agility Pack.

var s = "<span id=\"myspan\">2,500</span><span id=\"myspan\">500</span>";
var  doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(s);
foreach (var node in doc.DocumentNode.ChildNodes.Where(n => n.Name == "span"))
{
    string value = node.InnerHtml;
    // here you can transform string value to integer and so on
    Console.WriteLine(value);
}

Note: Html Agility pack can also be installed as NuGet package with Visual Studio.

Community
  • 1
  • 1
Andrey Korneyev
  • 26,353
  • 15
  • 70
  • 71
3

Not possible to do this with just regular expression, but it is possible to do it in two passes applying two different regular expressions in each pass.

In 1st pass you just match numbers containing dots and/or spaces in the 2nd pass you use a regex to remove the dots,spaces etc and leave olnly the numbers

example regular expressions:

1st pass: (?:\<\bspan\b.*?\bmyspan\b.*?\>)(?<numbers>[ ,.0-9]+)(?:\</\bspan\b\>)

2nd pass: replace [ .,] with an empty character '' on matched number

Nikos M.
  • 8,033
  • 4
  • 36
  • 43
  • would it be possible to combine the two parses into one ? – GregoryBrad May 21 '15 at 05:40
  • @GregoryBrad. no because in this case it would be possible with one regular expression, but regular expressions cannot account for context, so this is where the two-pass hierarchical approach helps. In an one-pass approach the second regex would have to be context-sensitive in a non-trivial way while in the 2-pass approach it can be a simple regex – Nikos M. May 21 '15 at 10:47
2

EDIT (inspired by @AndyKorneyev's answer):

With HtmlAgilityPack, you can obtain the <span> tags you need by querying those having myspan attribute value.

var txt = "<span id=\"myspan\">2,500</span><span id=\"myspan\">500</span>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(txt);
foreach (var node in doc.DocumentNode.ChildNodes.Where(p => p.Name == "span" && p.HasAttributes && p.GetAttributeValue("id", string.Empty) == "myspan"))
{
   var val = node.InnerHtml;
   Console.WriteLine(val.Replace(",", string.Empty));
}

Outputs:

2500
500

ORIGINAL:

Here is an approach without a regex, using an XElement and Replace:

var txxt = "<span id=\"myspan\">2,500</span>\r\n<span id=\"myspan\">500</span>";
var Xelt = XElement.Parse("<root>" + txxt + "</root>");
var vals = Xelt.DescendantsAndSelf("span").Select(p => p.Value.Replace(",", string.Empty)).ToList();

Output:

enter image description here

Or a very weird regex approach removing all commas and tags:

 var result = Regex.Replace(txxt, @"(?><(?:\b|/)[^<]*>|,)", string.Empty);

Result is enter image description here.

And if you for some reason insist on your approach, just use look-arounds:

var rgx = new Regex(@"(?s)(?<=<\bspan\b[^<]*?\bmyspan\b[^<]*?\>)(?<numbers>[,0-9]*?)(?=</span>)");
var matched = rgx.Matches(txxt).Cast<Match>().Select(p => p.Value.Replace(",", string.Empty)).ToList();
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I hope my answer helps you. It is also my first answer where I tested HtmlAgilityPack. SO made me believe regex is not the panacea. – Wiktor Stribiżew May 20 '15 at 12:09
  • Do you really think it is correct behaviour: just copypaste solution with Html Agility Pack from my answer to your one? Since your original answer was prior to my one - now without looking to answer revision history it looks like I copied your solution. – Andrey Korneyev May 20 '15 at 12:27
  • @AndyKorneyev: Sorry, I agree I had a look at it, but it is not completely the same since you are not checking for the `id` attribute value in your suggestion. I will add that I came up with this thanks to you. – Wiktor Stribiżew May 20 '15 at 12:30
  • @GregoryBrad: Please let me know if anything of that helped you, or if you need further assistance. – Wiktor Stribiżew May 21 '15 at 22:14
1

stribizhev's approach is good, you shouldn't use regexes to parse HTML/XML when there are better tools available. As for taking only the digits, as an alternative to the proposed p.Value.Replace(",", string.Empty) here is a version that uses LINQ and removes anything that is not a digit:

new string(p.Value.Where(ch => char.IsDigit(ch)).ToArray())

This works because the string class implements IEnumerable<char>.

Konamiman
  • 49,681
  • 17
  • 108
  • 138