2

I take HTML in as a string and then I parse it to change all href links to something else. This works however, when the HTML page has some JS script tags i.e. <script> it gets removed! For example this line:

<script type="text/javascript" src="/js/jquery.js"></script>

Gets Changed to:

[removed][removed] 

However, I would like to keep everything in. This is my function:

function parse_html_code($code, $code_id){

libxml_use_internal_errors(true);

$xml = new DOMDocument();

$xml->loadHTML($code); 

foreach($xml->getElementsByTagName('a') as $link) {

  $link->setAttribute('href', CLK_BASE."clk.php?i=$code_id&j=" . $link->getAttribute('href'));

}

return $xml->saveHTML();

}

I appreciate any help on this.

Yahel
  • 37,023
  • 22
  • 103
  • 153
Abs
  • 56,052
  • 101
  • 275
  • 409
  • 1
    DOM will not remove any tags on it's own or insert `[removed]` markers anywhere. Please provide a reproducable example that illustrates the problem. – Gordon Mar 20 '11 at 13:45
  • X-Ref: [PHP Headless Browser?](http://stackoverflow.com/questions/6578132/php-headless-browser) – hakre Jun 25 '13 at 15:39

1 Answers1

3

CodeIgniter's bogus anti-XSS ‘feature’ is mauling your script's input before DOMDocument gets a look at it. Script tags and various other strings will be removed, replaced with “[removed]” other otherwise messed-about with for no good reason. See the system/libraries/Security.php module for the full embarrassing details.

To turn off this misguided feature, set $config['global_xss_filtering']= FALSE. You'll have to make sure your script is actually handling string escaping properly, of course (eg always HTML-escaping user input when including in a page). But then you have to do that anyway; anti-XSS doesn't fix your text processing problems, it just obscures them.

$link->setAttribute('href', CLK_BASE."clk.php?i=$code_id&j=" . $link->getAttribute('href'));

You'll need to urlencode that getAttribute('href') (and potentially $code_id if it's not just numeric or something).

bobince
  • 528,062
  • 107
  • 651
  • 834
  • 1
    Thanks for another reason not to use CI. – Gordon Mar 20 '11 at 14:28
  • Thank you for the explanation and tip about the URL encode! I feel like I have to defend CI since I am using it, but I am not going to as I don't much about it...yet, but I am sure it has its pros. – Abs Mar 20 '11 at 14:32