2

Given below is C# code. I have tried the same Regular expression which is there in the code, but for some reason I am not getting the diesired output. The html given in the code is just an example. The code can be compiled using C# compiler.

Here is the code.

var x = @"
    <html>
        <table>
            <tr>
                <td class=""l w60"">Adjustments:<input id=""textbox1"" type=""textbox"" name=""textbox1"" data-label-text=""Misc. Comment12""/> </td>
                <td class=""l w60"">Adjustments:<input id=""textbox1"" type=""textbox"" name=""textbox1"" data-label-text=""Misc. Comment13""/> </td>
                <td class=""l w60"">Adjustments:<input id=""textbox1"" type=""textbox"" name=""textbox1"" No match=""Misc. Comment13""/> </td>
            </tr>           
        </table>            
    </html>";

Regex regex = new Regex(@"[\n\r].*data-label-text=""\s*([^\n\r]*)");
MatchCollection matchList = regex.Matches(x);
var list = matchList.Cast<Match>().Select(match => match.Value).ToList();

When I see the contents of the list I find these two values.

1. <td class="l w60">Adjustments:<input id="textbox1" type="textbox" name="textbox1" data-label-text="Misc. Comment12"/> </td>

2. <td class="l w60">Adjustments:<input id="textbox1" type="textbox" name="textbox1" data-label-text="Misc. Comment13"/> </td>

But this is not the desired output. The desired output is given below.

1.Misc. Comment12

2.Misc. Comment13

Something has to be modified in Regex to get the desired output which I am not good at. Please tweak the Regex, so that desired output can be achieved.

Arulkumar
  • 12,966
  • 14
  • 47
  • 68
Arya220
  • 105
  • 9
  • As a sample `(?<=data-label-text=").*?(?=")`. But you'd better use HTML parsing lib. – Ulugbek Umirov Mar 30 '15 at 11:08
  • 1
    As @UlugbekUmirov said, you should use a HTML parsing lib like the [HtmlAgilityPack](https://htmlagilitypack.codeplex.com/). [You should not use regex to parse html](http://stackoverflow.com/q/1732348/1248177). – aloisdg Mar 30 '15 at 12:22

1 Answers1

2

You can use a look-behind and a more restrictive character class to exclude matching "s:

  Regex regex = new Regex(@"(?<=[\n\r].*data-label-text="")\s*([^\n\r""]*)");

Or a bit improved version that will also strip leading/trailing spaces from the attribute value (remove \s* if you do not need that):

  Regex regex = new Regex(@"(?<=\sdata-label-text=""\s*)[^""]*(?=\s*"")");

Output:

enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563