I was wondering if somebody could help me use string split to get all occurrences of text in between <p>
</p>
tags in an HTML document?
Asked
Active
Viewed 684 times
0
-
If you mean to keep your HTML valid, have a look at this recent SO question: http://stackoverflow.com/questions/1714764/c-truncate-html-safely-for-article-summary – Abel Nov 18 '09 at 15:50
-
Actually: do you mean to keep it (X)HTML-sane? – Abel Nov 18 '09 at 15:52
-
oooh, hello again Abel :) thanks for that link. i found some code on there that got me thinking alot more and i think by the time i've done adding/editing some of that code it might look completely different lol. i'll add the solution to my problem to the end of my question if it works and hope that somebody will find a use for it :P – jay_t55 Nov 18 '09 at 15:57
-
um, unfortunately, i have no control over how-well-formatted the html/xml/xhtml documents are since i am not (mainly) the one who creates them. the ones that i make (probably about 1% of them) are well-formatted/valid docs – jay_t55 Nov 18 '09 at 15:59
5 Answers
6
Sounds like you want to look at the HTML Agility Pack. It works very well on dodgy HTML documents!

RichardOD
- 28,883
- 9
- 61
- 81
-
lol..are you one of the people who made that? btw thank you very much for the link i'm downloading it now it sounds awesome! – jay_t55 Nov 18 '09 at 15:47
-
2
That's rather a large problem for String.Split()
. I'd recommend using an XML parser instead.

harriyott
- 10,505
- 10
- 64
- 103
-
it doesnt even have to get the whole lot. maybe just four or five occurances? would that be cool? ...xml doesn't like me :( – jay_t55 Nov 18 '09 at 15:43
-
1Ah, ok. Is there any chance the opening
tag will have attributes (e.g
)?
– harriyott Nov 18 '09 at 15:45 -
there is a chance. but if its too hard if it does then i will settle for just
for now :)
– jay_t55 Nov 18 '09 at 16:00
2
Take a look at regular expressions. String split is not a good solution.

rerun
- 25,014
- 6
- 48
- 78
-
1Just say NO to using RegEx for parsing HTML. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Greg Nov 18 '09 at 16:07
-
1
For the benefit of the folks who suggest RegEx, can I just point to this answer:
RegEx match open tags except XHTML self-contained tags (Stack Overflow)
Just say no.
0
i've been doing this manually, just traversing the string in a loop and counting the <p>
tags and if you found one <p
and than another <p
and another and than you suddenly have a </p>
than you must wait until you find the 3rd </p>
and there you have it

Omu
- 69,856
- 92
- 277
- 407