-2

This is my html :

<div class="bla">
    <div>
        bla bla
    </div>
    <div>
        bla bla 2
    </div>
    <p></p>
</div>

I want to get class="bla" content with c# regex. I've tryied :

MatchCollection postCollection = Regex.Matches(html, "<div class=\"bla\".*?>(.*?)<\\/div>");

But it only gives me this portion of content:

<div class="bla">
    <div>
        bla bla
    </div>

As soon as first div closes.

user3857731
  • 599
  • 1
  • 6
  • 18
  • 6
    You should really use a HTML parser like Html Agility Pack for dealing with HTML, rather than regexes. [Here is the classic answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) explaining why. – RB. Jan 15 '16 at 12:17
  • Why not use a DOM parser instead? – David Jan 15 '16 at 12:17
  • well i am parsing facebook, and these tags are inside code hidden tag. html agility pack cant see it – user3857731 Jan 15 '16 at 12:18
  • After html load,it dinamicly creates html from javascript – user3857731 Jan 15 '16 at 12:19
  • @user3857731: `"html agility pack cant see it"` - If your code can't even *see* the markup, then how do you expect a regular expression to work either? It really sounds like you're inventing a problem that was solved long ago. If you have HTML to parse, use an HTML parser. – David Jan 15 '16 at 14:04

2 Answers2

2

Use a DOM parser, regex is not suitable for this: https://www.nuget.org/packages/HtmlAgilityPack

But as you mention that the page is generated at runtime with JavaScript this is not a suitable option. You will need a browser-like component: for example Selenium

Here you can find some examples: http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/

RvdK
  • 19,580
  • 4
  • 64
  • 107
  • I am using cron job for this purpose and is it suitable to use navigator? – user3857731 Jan 15 '16 at 12:35
  • Also i am sendig cookies, make automatic log in on facebook, if i just browse the page it won't help – user3857731 Jan 15 '16 at 12:38
  • 1
    Cookies is not a problem, the API allows to add cookies. I'm worried about your questions. Cron job is linux only, and navigator is a discontinued browser since 2008. – RvdK Jan 15 '16 at 12:43
  • 1
    Good luck with Regex, see you again when Facebook changes a small thing in the HTML and your regex doesn't work again. – RvdK Jan 15 '16 at 12:46
0

As others have mentioned you shouldn't use Regex for such cases. However, it is possible.

Here is my attempt at doing so:
(<div class="bla".*>([\w\s<>\/]*)<\/div>)

This surely needs more work and may be buggy, but could possibly head you in right direction.

Regex demo: here

Asunez
  • 2,327
  • 1
  • 23
  • 46
  • This still has problems if you add more divs after it, though. And since you can't tell regex to stop after given number of `` has occured, this might be quite tricky. – Asunez Jan 15 '16 at 12:53