0

I am looking to get the html that is included between the following text:

<ul type="square">  
</ul>

What's the most efficient way?

jason
  • 236,483
  • 35
  • 423
  • 525
Ian Vink
  • 66,960
  • 104
  • 341
  • 555
  • 2
    Have you considered using an HTML parser instead? Someone's going to ask you this question anyway; figured I'd be the first. Regex is not an ideal tool for parsing HTML. – Robert Harvey Sep 26 '11 at 15:10
  • Use a HTML parser. Don't use regex with HTML because HTML is not a regular language. – Bala R Sep 26 '11 at 15:10
  • 1
    Take a look at this other SO question: http://stackoverflow.com/questions/100358/looking-for-c-html-parser – dee-see Sep 26 '11 at 15:10
  • Don't use regular expressions to parse HTML, use an HTML parser instead. Regular expressions will never be 100% reliable for parsing HTML – Clive Sep 26 '11 at 15:11

4 Answers4

1

I always use XPath to do things like that.
Use an XPath that will extract the node and then you can fetch the InnerHTML from that node. Very clean, and the right tool for the job.

Additional details: The HAP Explorer is a nice tool for getting the XPath you need. Copy/paste the HTML into HAP Explorer, navigate to the node of interest, copy/paste the XPath for that node. Put that XPath string in a string resource, fetch it at runtime, apply it to the HTML document to extract the node, fetch the desired information from the node.

Task
  • 3,668
  • 1
  • 21
  • 32
-1

I agree that an HTML parser is the correct way to solve this problem. But, to humor you and answer your original question purely for academic interest, I propose this:

/<[Uu][Ll] +type=("square"|square) *>((.*?(<ul[^>]*>.*</ul>)?)*)<\/[Uu][Ll]>/s

I'm sure there are cases where this will fail, but I can't think of any so please suggest /* them */ more.

Let me restate that I don't recommend you use this in your project. I am merely doing this out of academic interest, and as a demonstration of WHY a regex that parses html is bad and complicated.

Robert Martin
  • 16,759
  • 15
  • 61
  • 87
-1

If you really want one:
@<ul type="square">(.*?)</ul>@im

3on
  • 6,291
  • 3
  • 26
  • 22
  • Instead of using the `m` flag, you probably meant the `s` flag, which will cause `.` to match line break chars as well. – Bart Kiers Sep 26 '11 at 17:46
  • no. Except in Ruby, the `m` flag will cause the `^` to match the start of a new line and `$` the end of one. The `.` will _not_ match line break chars. This is where the `s` flag is for (the `.` will match line breaks as well). This goes for: Perl, Python, Java, JavaScript and PHP, to name a few. – Bart Kiers Sep 27 '11 at 06:20
-2

Regular expressions should not be used to parse HTML!

This will definitely not work:

<ul type="square">(.*)</ul>
Community
  • 1
  • 1
soniiic
  • 2,664
  • 2
  • 26
  • 40
  • 2
    -1 Don't show what doesn't work, that isn't an answer. There is an infite list of things that doesn't work. – Oskar Kjellin Sep 26 '11 at 15:13
  • the problem is that yes, it works literally, but 99.9% sure the poster wants to get what's in the `
      ` block, so if it's a ul inside another ul, it shouldn't end with the inner `
    `
    – Rodolfo Sep 26 '11 at 15:29
  • you're saying `\n` doesn't match the regular expression "." I don't get it ? – Rodolfo Sep 26 '11 at 16:59
  • @Rodolfo, that is correct. In most regex implementations, `.` does not match line break characters. – Bart Kiers Sep 26 '11 at 17:45
  • 1
    @Rodolfo You can though set `RegexOptions.SingleLine` and it should match it – Oskar Kjellin Sep 26 '11 at 19:58
  • If the starting `
      ` is inside another `
        `, the greedy `(.*)` will match more than what you bargain for as it should return the very last `
      ` that it finds. You can use balancing group definitions to ensure the regex follows tag heirarchies and ends at the correct corresponding
    . However... that has it's own problems, namely the html must be valid xml which all too often it isn't. You can read more in my answer about balancing group definitions. I think it's a very overlooked feature that only .NET offers.
    – Sam Sep 26 '11 at 23:13