Scraping HTML outside of found element

Question

I use Simple HTML DOM Parser to match elements and extract the content required. However what I would like to do is to be able to get all of the HTML outside of

Say the HTML is

<body>
<div id="otherContent"></div>
<div id="content"></div>
<div id="otherContent2"></div>
</body>

I want to be able to get everything outside of the #content div.

Can Simple HTML DOM Parser can do this? I guess regex would be possible but a more elegant solution like HTML parser would be great.

Please share what you have tried. A DOM parser is what you're looking for, you should never use regex to parse HTML. — Jay Blanchard, Feb 02 '15 at 13:21
@JayBlanchard saying `never use regex for html` is just blindly following some "rules" just like `always use Dependency Injection`. There are situations, where regex is faster and better (especially that DOM parsers can trash html code badly if it doesn't have perfect syntax). This is not the one though, but don't just say `never` — Forien, Feb 02 '15 at 13:26

Igor Adamenko · Answer 1 · 2015-02-02T13:55:44.303

0

Yes, Simple HTML DOM Parser can do this. For example:

$html = "<your_html_here>";
$content = $html->find("#content");
$innertext = $content->innertext; // if you need all markup from #content
$plaintext = $content->plaintext; // if you need only text
$outertext = $content->outertext; // try it yourself :)

You also may clear any html:

$html = "<your_html_here>";
$html->find("#content")->outertext = ""; // now you've all markup in $html except #content

Read more in manual.

edited Feb 02 '15 at 13:55

answered Feb 02 '15 at 13:35

Igor Adamenko

861
1
8
20

$outertext = $content->outertext; is not correct this just includes the actual matched tag in the markup where as innertext doesn't return that. What I'm looking forward is to get all the HTML before #content and after – user2760338 Feb 08 '15 at 05:03
@user2760338 second part of code does what you want, doesn't it? if you set outertext as "" you remove #content-node from $html. all html before #content and after will be in $html. – Igor Adamenko Feb 09 '15 at 18:26

score 0 · Answer 2 · answered Feb 02 '15 at 14:15

0

You can use PHPquery (library is big, but very useful ) Here is examples: https://code.google.com/p/phpquery/

answered Feb 02 '15 at 14:15

bordeux

612
1
8
23

1

Well, there are [many options for parsing html](http://stackoverflow.com/a/3577662/3110638) :) – Jonny 5 Feb 02 '15 at 14:18

Scraping HTML outside of found element

2 Answers2