0

I've tried to accomplish this with regex but it seems not to be working at all. I tried the same regex pattern with PHP, Javascript and it worked like a charm. I have no idea why it's not working with C#.

Here is my code sample:

        Regex mysReg = new Regex(@"<form[^>]*action=""do\.php""[^>]*>(.*)<\/form>", RegexOptions.IgnoreCase | RegexOptions.Multiline);

        MatchCollection form = mysReg.Matches(html);

If I remove the part <\/form> the regex works ok but it doesn't get the content inside the parenthesis.

Now some of you will tell me to use "HtmlAgilityPack". I've tried to use it but, since I'm still unfamiliar with C#, I found it hard to work with it, since there is no documentation came with it.

So is there any way to work around this problem?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Desolator
  • 22,411
  • 20
  • 73
  • 96
  • Your regex is pretty close, but you need to change the `(.*)` to ([\S\s]*?)` (or `(.*?)` and add `RegexOptions.SingleLine`). This is because the dot does not match a newline by default. (See Justin's answer) Also, if there are multiple FORMs on the page, you also need the `?` lazy quantifier as I've shown here). – ridgerunner Mar 23 '11 at 16:31

3 Answers3

3

Your (.*) isn't matching newlines. ([\S\s]*?) will work, or you can turn newline matching on with RegexOptions.SingleLine.

However, as others have pointed out, you should be using something like the HTML Agility Pack instead of trying to use regex to parse HTML.

Community
  • 1
  • 1
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • +1 However, if there are more than one matching forms and there is stuff between them, then the `.*` (or `[\S\s]*`) needs to have the lazy quantifier (i.e. `.*?` or `[\S\s]*?`). – ridgerunner Mar 23 '11 at 15:55
  • @ridgerunner - Agreed, and edited. This of course has potential problems as well, which is why regex is the wrong tool for this. – Justin Morgan - On strike Mar 23 '11 at 16:30
  • Yes, the lazy-dot-star method will certainly NOT work for elements which can be nested, but FORM elements are not nestable. The only thing I can think of that will trip this one up (for a valid document) is a `` or `` appearing inside `CDATA` spans such as comments, SCRIPTs, STYLE's, `...?>` and tag attributes.) – ridgerunner Mar 23 '11 at 16:41
  • I just set the "RegexOptions.SingleLine" and it worked like charm! thank you so much for the solution... I thought its another security thing implemented by the mighty microsoft.. – Desolator Mar 24 '11 at 06:08
2

Instead of reg ex, use the HTML Agility Pack to parse the document. You might not be comfortable with it, but that's the way to go.

The download comes with examples - projects that do all sorts of things, so you can read through the code to see how they were accomplished.

You will then be able to query it in XPath syntax, though it exposes an interface similar XmlDocument.

See here for a compelling reason to not use RegEx for parsing HTML.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • @ermac2014 - There are examples provided in the download. And with intellisense you can explore the objects and see how to work with them. It is not difficult to learn. – Oded Mar 23 '11 at 15:16
  • 1
    @ermac2014: maybe that should be your next question, then. If the right tool for the job is one you don't know how to use, you should learn to use that tool - not bodge the job. – Dan Puzey Mar 23 '11 at 15:17
  • @Oded - do you have any example or how to accomplish this or similar thing? i'm really lost – Desolator Mar 23 '11 at 15:22
  • @Oded - dude its only one example and it doesn't show how really this thing work.. I tried to follow it but I really have no idea what is going on.. – Desolator Mar 23 '11 at 15:30
  • @ermac2014 - Have you downloaded the source pack? There are many example projects included. – Oded Mar 23 '11 at 15:35
  • yes I downloaded that and there is no examples inside.. anyway thanks for the help.. I have already made it working with the mighty regex :P – Desolator Mar 24 '11 at 06:10
1

I was playing with this in RegexBuddy and got @"<form[^>]*action=""do\.php""[^>]*>([\s\S]*)<\/form>" wot work with my (hastily put together) sample data.

QuinnG
  • 6,346
  • 2
  • 39
  • 47