-3

why this regex doesn't match any of the following strings?

string regx = "<td\\s+class=\"inline-rating-sm\"\\s+data-ci=\"\\d + \">\\s+(\\d+)</td>";

Test strings:

<td class="inline-rating-sm" data-ci="943"> (150)</td>
<td class="inline-rating-sm" data-ci="922"> (66)</td>
Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
medo ampir
  • 1,850
  • 7
  • 33
  • 57
  • 2
    i'd consider a dom parsing library. – Daniel A. White Oct 07 '18 at 16:24
  • 1
    Don't wrote "a regex to match strings". The way to write a regex is basically to take the string to match, escape everything in it that's special regex symbols, and then replace any variable contents (like the numbers here) by expressions. I also strongly advice using some editor that supports regex highlighting. There are plenty of regex testers online that can do that. – Nyerguds Oct 07 '18 at 16:32
  • 1
    More info here https://stackoverflow.com/a/1732454/397817 – Stephen Kennedy Oct 07 '18 at 19:28

3 Answers3

1

Because

\"\\d + \">

matches a " explicitly, then any digit, then a space one or more times, then another space, then "> explicitly. I think you want

\"\\d+\">

Additionally you're not escaping the () parentheses, which mean a capture group in regex, or the / in </td>.

Also you might want to use the verbatim modifier @.

var regx = @"<td\s+class=""inline-rating-sm""\s+data-ci=""\d+"">\s+\(\d+\)<\/td>";

It's more readable without constant \\ escaping.

V0ldek
  • 9,623
  • 1
  • 26
  • 57
0

Because following syntax is special in regex: (...), meaning capturing group.

If you want to match brackets literally, you need to escape them: \\( and \\) (I used double slash, first to escape the other slash, so it escapes bracket in regex :) ).

You also need to escape / in </td>, below I present pattern after few corrections.

You need to modify your pattern to: <td\s+class="inline-rating-sm"\s+data-ci="\d*">\s+\(\d+\)<\/td> (remember to escape slashes in C# :) ).

Demo

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
-1

As an answer to the underlying problem, in many circumstances an XPath expression to match them is a better option, and can be simpler and more robust.

For example, I added the HtmlAgilityPack to a new project with "Tools" -> "NuGet Package Manager" -> "Manage NuGet Packages for Solution..." and used this:

static void Main(string[] args)
{
    string h = @"<html><head><title></title></head><body>
<table class=""table"">
<tr><th scope=""row"">Not this</th><td>123</td></tr>
<tr><th scope=""row"">Or this</th><td>456</td></tr>
<tr><td class=""inline-rating-sm"" data-ci=""943""> (150)</td><td class=""inline-rating-sm"" data-ci=""922""> (66)</td></tr>
</table>
</body></html>";

    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(h);

    var table = doc.DocumentNode.SelectSingleNode(@"//table[@class='table']");
    var cells = table.SelectNodes(@".//td[@class='inline-rating-sm' and @data-ci]");

    // do something with the cells...
    foreach (var cell in cells)
    {
        Console.WriteLine(cell.GetAttributeValue("data-ci", "") + " " + cell.InnerText.Trim());

    }

    Console.ReadLine();

}

to output:

943 (150)
922 (66)

Andrew Morton
  • 24,203
  • 9
  • 60
  • 84