1

Using PHP i want to remove all HTML attributes except

"src" attribute from "img" tag

and

"href" attribute from "a" tag

My Input file is .html file which is been converted from .doc and .docx

My output file again should be HTML file with removed attribute

Kindly help me please

Edit ::

After Trying alexander script as below if i open the strip.html in code editor i don't see any changes

<?php
$path = '/var/www/strip.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('src' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

if (false === ($elements = $xpath->query("//a"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('href' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

$dom->saveHTMLFile($path);

?>
PHP Geany
  • 62
  • 1
  • 9
  • 1
    http://stackoverflow.com/questions/2994448/regex-strip-html-attributes-except-src – Stefan Apr 16 '14 at 13:14
  • @stefan how to make it work as if i input html and click a button i should ask to save the processed html file ??? – PHP Geany Apr 16 '14 at 13:21
  • That link should help you get started, I'm not going to architect your app for you but after you get your html, however that be, pass it through the regex(es). – Stefan Apr 16 '14 at 13:27

1 Answers1

2

Use DOMDocument class for parsing HTML ("a" and "img" tags processing):

$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('src' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

if (false === ($elements = $xpath->query("//a"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if ('href' !== $name) {
            $element->removeAttribute($name);
        }
    }
}

$dom->saveHTMLFile($path);

Also, read why you can't parse [X]HTML with regex and take a look at useful xpath links.

Update (all tags with exception "a" and "img" attributes processing):

$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//*"))) die('Error');

foreach ($elements as $element) {
    for ($i = $element->attributes->length; --$i >= 0;) {
        $name = $element->attributes->item($i)->name;
        if (('img' === $element->nodeName && 'src' === $name)
            || ('a' === $element->nodeName && 'href' === $name)
        ) {
            continue;
        }

        $element->removeAttribute($name);
    }
}

$dom->saveHTMLFile($path);
Community
  • 1
  • 1
Alexander Yancharuk
  • 13,817
  • 5
  • 55
  • 55
  • THis is outputing the same file as input... What i did : saved the code you gave as php , changed the $path value to the input file path also added a new string $pathh for output path and in last line changed $path to $pathh .. loaded the php file in browser .i received the output as same as input in the $pathh dir .. attributes was not removed – PHP Geany Apr 17 '14 at 04:33
  • @PHPGeany Definitely, you did something wrong, because this code is ok. Proof: [codepad link](http://codepad.org/TIcGuAHw) – Alexander Yancharuk Apr 17 '14 at 05:59
  • @PHPGeany "loaded the php file in browser"? Do you have installed http-server on your local machine? Did you try to run this code through console, like "php codesource.php"? Browsers haven't integrated php-interpreter, thats why loading php-code in browser does nothing. – Alexander Yancharuk Apr 17 '14 at 06:22
  • find the edit i made in my original question i use lamp stack i tried using local browser as localhost/path.php – PHP Geany Apr 17 '14 at 06:23
  • Do you have any errors/warnings in error_log? Try to change `$dom->saveHTMLFile($path);` to `var_dump($dom->saveHTML());` to see if there is problems with attribs removing. – Alexander Yancharuk Apr 17 '14 at 06:34
  • @PHPGeany I don't see any **"a"** or **"img"** tags in [tool.setinfotec.com/sof.php source](http://tool.setinfotec.com/sof.php). Thats why you can't see any changes :) If you want remove other tags attribs like **"p"** or **"div"**, you need to modify code. My code is just example for **"a"** and **"img"** tags... – Alexander Yancharuk Apr 17 '14 at 06:41
  • in line 977 in page source there is image tag and many other lines – PHP Geany Apr 17 '14 at 06:43
  • after using var dumb it displays string(106966) " at the top and " at the bottom – PHP Geany Apr 17 '14 at 06:45
  • @PHPGeany My falult, sorry. There is really some "img" tags. But they're only with **"src"** attribs. Seems script is ok. – Alexander Yancharuk Apr 17 '14 at 06:46
  • actually what i asked was i want to remove all attributes in all tags except src attribute in img tag and href attribite in a tag .. can u kindly provide me the code for that please – PHP Geany Apr 17 '14 at 06:50
  • thanks a ton it does ... thanks a lot .. if you dont mind . i want it as upload the html sorce file and the browser ask me to save the processed file.. sorry for troubling ... it will mean a lot to me now .. – PHP Geany Apr 17 '14 at 07:08
  • @PHPGeany You're welcome :) Well, you can ask another question about how to implement form with source and destination files paths – Alexander Yancharuk Apr 17 '14 at 07:17
  • http://stackoverflow.com/questions/23127036/how-to-implement-form-with-source-and-destination-files-paths – PHP Geany Apr 17 '14 at 07:26
  • That answer doesn't comply your requirements. I deleted it. – Alexander Yancharuk Apr 17 '14 at 10:09
  • Ok .. I clearly made my req here in below link .. kindly look at it and answer if possible mate pls .. http://stackoverflow.com/questions/23129320/php-dom-implementation – PHP Geany Apr 17 '14 at 10:32
  • Kindly look at http://stackoverflow.com/questions/23210247/dom-tag-and-attribute-rules-and-filtering – PHP Geany Apr 22 '14 at 11:37