c# regex parsing

Question

I am trying to parse data from a very long html content. I am just pasting here the important part I am interested in:

Technical Details

<div class="content">

    <ul style="list-style: disc; padding-left: 25px;">

      <li>1920x1080 Full HD 60p/24p Recording w/7MP still image</li>
      <li>32GB Flash Memory for up to 13 hours (LP mode) of HD recording</li>
      <li>Project your videos on the go anywhere, anytime.</li>
      <li>Wide Angle G lens to capture everything you want.</li>
      <li>Back-illuminated "Exmor R" CMOS sensor for superb low-light video</li>

    </ul>

  <div id="technicalProductFeatures"></div>

I need to start parsing from :

<div class="content">

til

<ul

and then until

</ul>

I have tried following regex but it did not work:

Regex specsRegex = new Regex ("<div class=\"content\">[\\s]*<ul.[\\s]*</ul>");

this gives me nothing..

One other issue is sometimes it has a linebreak and sometimes not between initial div and ul tags like:

<div class="content">
<ul style="list-style: disc; padding-left: 25px;">

or

<div class="content">

<ul style="list-style: disc; padding-left: 25px;">

thanks for any help.

please see this http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Matt, Oct 03 '11 at 14:23
Won't someone think of the children?! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — rrhartjr, Oct 03 '11 at 14:24
I really encourage you to read answers to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Łukasz Wiatrak, Oct 03 '11 at 14:27
Lucasus has clearly spent three minutes searching for that link :) — sehe, Oct 03 '11 at 14:30
`\s` means whitespace in regex. Your `[\\s]*` are only matching strings of whitespace (spaces, tabs, etc.). — Justin Morgan - On strike, Oct 03 '11 at 14:31

Steve Wortham · Accepted Answer · 2011-10-03T15:14:13.063

3

I wouldn't suggest using regular expressions for this. It's like trying to fix a tire with a hammer. The hammer is a good tool, but it's not for everything.

I'd use Html Agility Pack. It's not clear to me exactly what you're looking to extract. But I'll assume it's the list items. So you'd do something like this...

var hdoc = new HtmlAgilityPack.HtmlDocument();
hdoc.LoadHtml(YourHtmlGoesHere);

var MatchingNodes = hdoc.DocumentNode.SelectNodes("/html/body/div/ul/li");

As you can see, the syntax for the Html Agility Pack is based on XPATH and is much simpler for this task. It's also much more robust and something as silly as nested tags or a comment is not going to throw it off. Those types of things can throw off even the most carefully written regular expression in this scenario.

UPDATE

If you were determined to create a quick & dirty regular expression for this, it'd be something like this...

<div class="content">.*?</ul>

Ordinarily the .*? part matches anything except lines feeds 0 or more times, as few times as possible. So be sure to use RegexOptions.Singleline so that the . will match line feeds as well. This should work for the example you've given, but a commented bit of code with </ul> in it could throw it off, or a nested <ul></ul> could throw it off as well.

UPDATE #2

This will grab everything between the <ul></ul>...

(?<=<div class="content">\s*<ul[^>]*>).*?(?=</ul>)

Again, be sure to use RegexOptions.Singleline.

edited Oct 03 '11 at 15:14

answered Oct 03 '11 at 14:31

Steve Wortham

21,740
5
68
90

I dont have htmlagilitypack loaded also I wanna learn how I can strip that data with Regex. Regex is not the best tool but still I wanna learn how I can do that. – Val Nolav Oct 03 '11 at 14:40
Based on his description, I think you want `
.*?(?=
`. It sounds like he wants to pull two result sets out of there.
– Justin Morgan - On strike Oct 03 '11 at 14:56
Thanks Justin. Your code is good but still does not work. I changed it as [\\s]*(?=
– Val Nolav Oct 03 '11 at 15:02
Steven I am sorry to take your precious time. Your code works but the problem is I need to start parsing after
not after
and sometimes there is a line break after
and sometimes there is not.
– Val Nolav Oct 03 '11 at 15:08
Thanks Steve, I am trying your update. Once I learn the Regex style, I will use HtmlAgilityPack in my actual code. I installed htmlagilitypack. However, I really need to learn Regex part to for my future projects. – Val Nolav Oct 03 '11 at 15:21
Steve your updated code worked like a charm I just made some minor changes. However, in HtmlAgilityPack I got a problem. I am using c# and there is no MatchingNodes in c#. Do I need to add using references? – Val Nolav Oct 03 '11 at 15:32
@Val - No, MatchingNodes is just a variable name. I declared it as a var, but it's actually of type `HtmlNodeCollection`. – Steve Wortham Oct 03 '11 at 15:46

score 2 · Answer 2 · answered Oct 03 '11 at 14:24

2

Regex isn't the best tool to parse html (to put it mildly). Use HtmlAgilityPack.

answered Oct 03 '11 at 14:24

Hans Keﬆing

38,117
9
79
111

probably not but I wanna use Regex – Val Nolav Oct 03 '11 at 14:29
@ValNolav: nobody is stopping you. Just don't expect many people to help you - this site is about helping and answers are _judged_ by their _quality_. That means that many people are not about to spend much time writing tedious answers of lower quality... – sehe Oct 03 '11 at 14:32
@Val - Why do you want to use regex? If you think it will save you time or effort, it won't. Regex is not the right tool for this. – Justin Morgan - On strike Oct 03 '11 at 14:37
I just wanna learn how I can do that. There are many occasions I meet a line break and I wanna parse it with Regex not only with HTML but also in other text files. – Val Nolav Oct 03 '11 at 14:41
@Matt - What he's trying to do is not impossible; it's been shown several times that regex can be part of an HTML parser, especially for a limited set of HTML. Full-featured regex engines can incorporate recursion, balanced groups, etc. that make parsing HTML at least possible. However, it is an *enormous* headache, and far more complicated than it seems. It's really not something anyone should do in most circumstances. – Justin Morgan - On strike Oct 03 '11 at 14:42
@Val - If you want to know how to work with line breaks, or parse non-HTML text, I suggest you submit a different question focusing on what you want to parse. This question is much bigger than those issues, and they are likely to get lost in the furor. We will gladly help you understand regex, but the first step is to have a task that regex is useful for. – Justin Morgan - On strike Oct 03 '11 at 14:45
Justin I know a lot of things in Regex but I am poor at line breaks. I used Regex in many codes and did alot of successful parsing both in html and text documents. However, for long time I have really wanted to learn these line breaks and that s why I asked this question and I learned something new with HtmlAgilityPack, I will try to get used to that for my html related projects. – Val Nolav Oct 03 '11 at 15:49

c# regex parsing

Technical Details

2 Answers2