2

My mission is to explore blogs and get their latest post. Now I have script that do the task and store the content as html in database.
Everything works properly except template inference. Means that if the content html code for example has an extra </div> or forget to close a tag, it will ruin my entire page.

Question: Is there any way to limit the external content to one division and therefore if external code had some problems, just influence template of that div box not entire template?

Link to correct template
Link to damaged template

Thanks in advance

Hossein Shahsahebi
  • 6,348
  • 5
  • 24
  • 38

2 Answers2

1

We can simplify that by using a library that fix the malformed code that was scrapped.

You can do like that:

<?php
$content = '<div><p>I am a bad guy, and i am gonna put an additional div at the end.</p></div></div>';

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
libxml_clear_errors();
$content = trim($dom->saveHTML());


echo $content;

It will return:

<div><p>I am a bad guy, and i am gonna put an additional div at the end.</p></div>
Iago
  • 1,214
  • 1
  • 10
  • 19
  • Thanks lago it works correctly but it cant support utf8 encoding and the result are like this: `…Ø®Ø§Ù„ÙØ§Ù† حمله روسیه به داعش ` – Hossein Shahsahebi Oct 18 '15 at 08:23
  • Yes, i recommend you see that question: http://stackoverflow.com/questions/11309194/php-domdocument-failing-to-handle-utf-8-characters just to do not duplicate here. – Iago Oct 18 '15 at 08:25
1

The only safe way to ensure it doesn't affect anything else on your page, as far as I'm aware, is to iframe it. Anything else is going to be injecting into your page, so you'd be risking things you've mentioned like unclosed tags, style tags that override your CSS, potentially malicious JS etc unless you do some serious parsing and error correction. Some of this is done by things like JQuery's AJAX function, but if you can't risk anything at all, I'd go with an iframe that displays a page that renders your scraped content.

Chris Disley
  • 1,286
  • 17
  • 30