c# regex get html tag content

Question

This is my html :

<div class="bla">
    <div>
        bla bla
    </div>
    <div>
        bla bla 2
    </div>
    <p></p>
</div>

I want to get class="bla" content with c# regex. I've tryied :

MatchCollection postCollection = Regex.Matches(html, "<div class=\"bla\".*?>(.*?)<\\/div>");

But it only gives me this portion of content:

<div class="bla">
    <div>
        bla bla
    </div>

As soon as first div closes.

You should really use a HTML parser like Html Agility Pack for dealing with HTML, rather than regexes. [Here is the classic answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) explaining why. — RB., Jan 15 '16 at 12:17
well i am parsing facebook, and these tags are inside code hidden tag. html agility pack cant see it — user3857731, Jan 15 '16 at 12:18
@user3857731: `"html agility pack cant see it"` - If your code can't even *see* the markup, then how do you expect a regular expression to work either? It really sounds like you're inventing a problem that was solved long ago. If you have HTML to parse, use an HTML parser. — David, Jan 15 '16 at 14:04

score 2 · Answer 1 · answered Jan 15 '16 at 12:26

2

Use a DOM parser, regex is not suitable for this: https://www.nuget.org/packages/HtmlAgilityPack

But as you mention that the page is generated at runtime with JavaScript this is not a suitable option. You will need a browser-like component: for example Selenium

Here you can find some examples: http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/

answered Jan 15 '16 at 12:26

RvdK

19,580
4
64
107

I am using cron job for this purpose and is it suitable to use navigator? – user3857731 Jan 15 '16 at 12:35
Also i am sendig cookies, make automatic log in on facebook, if i just browse the page it won't help – user3857731 Jan 15 '16 at 12:38
1

Cookies is not a problem, the API allows to add cookies. I'm worried about your questions. Cron job is linux only, and navigator is a discontinued browser since 2008. – RvdK Jan 15 '16 at 12:43
1

Good luck with Regex, see you again when Facebook changes a small thing in the HTML and your regex doesn't work again. – RvdK Jan 15 '16 at 12:46

score 0 · Answer 2 · answered Jan 15 '16 at 12:52

0

As others have mentioned you shouldn't use Regex for such cases. However, it is possible.

Here is my attempt at doing so:
(<div class="bla".*>([\w\s<>\/]*)<\/div>)

This surely needs more work and may be buggy, but could possibly head you in right direction.

Regex demo: here

answered Jan 15 '16 at 12:52

Asunez

2,327
1
23
46

This still has problems if you add more divs after it, though. And since you can't tell regex to stop after given number of `` has occured, this might be quite tricky. – Asunez Jan 15 '16 at 12:53

c# regex get html tag content

2 Answers2