0

I get the following code from a web browser.

My Source code:(Html)

<dl class="field-dl output-field-dl" >
    <dt class="field-dt output-field-dt">
        <label><span>Product Code:</span></label>
    </dt>
    <dd class="field-dd output-field-dd ">
            0234567
    </dd>
</dl>

<dl class="field-dl output-field-dl" >
    <dt class="field-dt output-field-dt">
        <label><span>Per no:</span></label>
    </dt>
    <dd class="field-dd output-field-dd ">
            123456
    </dd>
</dl>

How do I extract my product code?

My current code is here:

var rx = new Regex("<span>Product Code:</span></label></dt><dd class=\"field-dd output-field-dd \">(.*?)</dd>\\s");            
var m = rx.Matches(kaynak);
foreach (Match match in m)
{
    string key = match.Groups[1].Value;
}

Thanks!

Ferhat
  • 3
  • 3
  • 1
    You'll find using regular expressions to process XML/HTML starts to become very difficult/unwieldy in most non trivial cases. Check these questions for some discussions/solutions/alternatives. http://stackoverflow.com/questions/787932/using-c-sharp-regular-expressions-to-remove-html-tags?lq=1, http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Chris Dec 26 '13 at 13:02
  • 3
    It is XML (HTML also), do not use Regex. Just parse it as an XML. – Konrad Kokosa Dec 26 '13 at 13:02
  • Will the input always look like that XML code you gave? If it does, you could just use the regex `[0-9]+`. – The Guy with The Hat Dec 26 '13 at 13:13

2 Answers2

0

The example code has lots more white space than your regex handles. You might add \s* before every < and after every >, but only one between them. Something like:

new Regex("<span>\\s*Product Code:\\s*</span>\\s*</label>\\s*</dt>\\s*<dd class=\"field-dd output-field-dd \">(.*?)</dd>\\s");

The capture group (.*?) may be too generous. I would suggest ([^<>]). If you prefer the . then setting the regex to accept newlines within . may be necessary. So consider using:

new Regex(... , RegexOptions.Singleline);

However, as others say, it is probably better to use HTML or XML parsing routines. This answer is intended to relate to just the regex part of your question.

AdrianHHH
  • 13,492
  • 16
  • 50
  • 87
0

You can use LINQ to XML:

XElement doc = XElement.Parse(html);
var query = doc.Descendants("dd").Select(elem => elem.Value).ToList();

foreach (var v in query)
    Console.WriteLine(v);
w.b
  • 11,026
  • 5
  • 30
  • 49