C# Regular Expressions - Get Second Number, not First

Question

I have the following HTML code:

<td class="actual">106.2% </td>

Which I get the number through two phases:

Regex.Matches(html, "<td class=\"actual\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);
Regex.Match(m.Groups[1].Value, @"-?\d+.\d+").Value

The above code lines gives me what I want, the 106.2

The problem is that sometimes the HTML can be a little different, like this:

<td class="actual"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>

In this last case, I can only get the 107.2, and I would like to get the 106.4 Is there some regular expression trick to say, I want the second number in the sentence and not the first?

*The problem is that sometimes the HTML can be a little different* is the key phrase. Did you consider using an HTML parser? — Wiktor Stribiżew, Jul 13 '15 at 14:26
Quick way: get all matches and take the last one. Better way: parse the HTML. — Evan Mulawski, Jul 13 '15 at 14:26
Parsing HTML with regular expressions can have [unfortunate consequences](http://stackoverflow.com/a/1732454/67392). Use an HTML parser instead; once you have the right text node, then use a regex on just that text. — Richard, Jul 13 '15 at 14:29
Thanks for your quick feedback! I'm trying just my last shoots with regex. If it doesn't work as I wish, I'll try a Html parser. — f4d0, Jul 13 '15 at 15:06

Wiktor Stribiżew · Answer 1 · 2015-07-16T07:06:12.943

Whenver you have HTML code that comes from different providers or your current one has several CMS that use different HTML formatting style, it is not safe to rely on regex.

I suggest an HtmlAgilityPack based solution:

public string getCleanHtml(string html)
{
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HtmlAgilityPack.HtmlEntity.DeEntitize(doc.DocumentNode.InnerText);
}

And then:

var txt = "<td class=\"actual\">106.2% </td>";
var clean = getCleanHtml(txt);
txt = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
clean = getCleanHtml(txt);

Result: enter image description here and

You do not have to worry about formatting tags inside and any XML/HTML entity references.

If your text is a substring of the clean HTML string, then you can use Regex or any other string manipulation methods.

UPDATE:

You seem to need the node values from <td> tags. Here is a handy method for you:

private List<string> GetTextFromHtmlTag(string html, string tag)
{
   var result = new List<string>();
   HtmlAgilityPack.HtmlDocument hap;
   Uri uriResult;
   if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
   { // html is a URL 
       var doc = new HtmlAgilityPack.HtmlWeb();
       hap = doc.Load(uriResult.AbsoluteUri);
   }
   else
   { // html is a string
       hap = new HtmlAgilityPack.HtmlDocument();
       hap.LoadHtml(html);
   }
   var nodes = hap.DocumentNode.ChildNodes.Where(p => p.Name.ToLower() == tag.ToLower() && p.GetAttributeValue("class", string.Empty) == "previous"); // SelectNodes("//"+tag);
    if (nodes != null)
        foreach (var node in nodes)
           result.Add(HtmlAgilityPack.HtmlEntity.DeEntitize(node.InnerText));
    return result;
}

You can call it like this:

var html = "<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 1.3\">0.9</span></td>\n<td class=\"previous\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";
var res = GetTextFromHtmlTag(html, "td");

enter image description here

If you need to get only specific tags,

If you have texts with a number inside, and you need just the number, you can use a regex for that:

var rx = new Regex(@"[+-]?\d*\.?\d+"); // Matches "-1.23", "+5", ".677"

See demo

I have worked with this HtmlAgilityPack in the past, and it took a lot of more code to do the same. Actually I am trying to catch the information just with regex, if I can't I'll use your advice and user this HTML parser. — f4d0, Jul 13 '15 at 15:02
It's not only about numbers inside text, because there are other numbers, it's about collection the right number :) Anyway, thanks for the information. — f4d0, Jul 16 '15 at 08:22
Sure, when it comes to getting the right number from a simple string, regex is obligatory, but when it is HTML, you are much safer with the parser that will serve you the right tagged texts first without much pain. I hope one day you will come back to this post. — Wiktor Stribiżew, Jul 16 '15 at 08:42

Sky Fang · Answer 2 · 2015-07-13T23:10:46.100

1

string html = @"<td class=""actual""><span class=""revised worse"" title=""Revised From 107.2%"">106.4%</span></td>
<td class=""actual"">106.2% </td>";
string patten = @"<td\s+class=""actual"">.*(?<=>)(.+?)(?=</).*?</td>";
foreach (Match match in Regex.Matches(html, patten))
{
    Console.WriteLine(match.Groups[1].Value);
}

I have changed the regex as your wish, The output is

106.4%
106.2%

edited Jul 13 '15 at 23:10

answered Jul 13 '15 at 14:32

Sky Fang

1,101
6
6

Your regular expression works only if the numbers have % in the end. But what if they have nothing? Or if they habe any other thing. Can you please adjust the regular expression for that? Thanks in advance. – f4d0 Jul 13 '15 at 14:59

score 1 · Answer 3 · answered Jul 13 '15 at 14:42

1

Try XML method

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;


namespace ConsoleApplication34
{
    class Program
    {

        static void Main(string[] args)
        {
            string input = "<td class=\"actual\"><span class=\"revised worse\" title=\"Revised From 107.2%\">106.4%</span></td>";

            XElement element = XElement.Parse(input);

            string value = element.Descendants("span").Select(x => (string)x).FirstOrDefault();

        }

    }

}

answered Jul 13 '15 at 14:42

jdweng

33,250
2
15
20

This might work in this case, but will fail if the input is not XML-valid. – Wiktor Stribiżew Jul 13 '15 at 15:06
Agree, but when a limited amount of sample data is posted you never know what the user is really trying to do. – jdweng Jul 14 '15 at 09:24
I still believe the key is that the *HTML can be a little different*. And that means we cannot be sure of the input quality let alone XML validity. Still, I like `XElement`. – Wiktor Stribiżew Jul 14 '15 at 09:27

score 1 · Accepted Answer · answered Jul 13 '15 at 16:14

I want to share the solution I have found for my problem.

So, I can have HTML tags like the following:

<td class="previous"><span class="revised worse" title="Revised From 1.3">0.9</span></td>
<td class="previous"><span class="revised worse" title="Revised From 107.2%">106.4%</span></td>

Or simpler:

<td class="previous">51.4</td>

First, I take the entire line, throught the following code:

MatchCollection mPrevious = Regex.Matches(html, "<td class=\"previous\">\\s*(.*?)\\s*</td>", RegexOptions.Singleline);

And second, I use the following code to extract the numbers only:

foreach (Match m in mPrevious)
        {


            if (m.Groups[1].Value.Contains("span"))
            {
                string stringtemp = Regex.Match(m.Groups[1].Value, "-?\\d+.\\d+.\">-?\\d+.\\d+|-?\\d+.\\d+\">-?\\d+.\\d+|-?\\d+.\">-?\\d+|-?\\d+\">-?\\d+").Value;
                int indextemp = stringtemp.IndexOf(">");
                if (indextemp <= 0) break;
                lPrevious.Add(stringtemp.Remove(0, indextemp + 1));
            }
            else lPrevious.Add(Regex.Match(m.Groups[1].Value, @"-?\d+.\d+|-?\d+").Value);
        }

First I start to identify if there is a SPAN tag, if there is, I take the two number together, and I have considered diferent posibilities with the regular expression. Identify a character from where to remove non important information, and remove what I don't want.

It's working perfect.

Thank you all for the support and quick answers.

Please consider upvoting those answers that proved helpful and accept the one that has worked for you. Mind that regex is not recommended when getting specific values from HTML marked-up document. I have updated my answer with a specific method that extracts clean text from `td` and other tags in HTML document. — Wiktor Stribiżew, Jul 16 '15 at 06:56
@stribizhev I have upvoted all of them, because all of them were very helpful! I did not define any as the right answer because none of them alone solved what I needed. Maybe the solution from Sky Fang, but I did not tested his new solution, I used mine. — f4d0, Jul 16 '15 at 08:20
Sure, you can accept your answer as well, it is great that you could solve the issue yourself. Keep it up! — Wiktor Stribiżew, Jul 16 '15 at 08:40
@stribizhev i did not solve it alone! I solve it thanks to you and the other repliers! Without you and the other guys it would take me much more time. — f4d0, Jul 17 '15 at 15:43

C# Regular Expressions - Get Second Number, not First

4 Answers4

Linked