Function for correcting broken links in web page scraped by SimpleHtmlDom

Question

I am scraping HTML using SimpleHtmlDom which gets the HTML as written, resulting in a lot of broken links to images and scripts because they do not include the full url to their resource location. Consequently the pages show with errors.

I have already corrected resource links like src="/, etc by replacing those letters with src="http://example.com/" but it gets tricky when there is no leading slash in the link, making it difficult to tell if it is a local link or a full link.

For example:

<img src="images/pic.jpg">

I need to locate and correct to read:

<img src="http://example.com/images/pic.jpg">

Is there a regex or function that I can use to I find src=" when there is no leading slash? Also need to cater for all types of links such as ahref, script, etc.

In a nutshell I need find instances of src=" and determine if it is lonely or includes http — WilliamK, Sep 10 '20 at 07:18
It would probably be a whole lot easier, if you just inserted a `` element with the proper URL set into the document … https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base — CBroe, Sep 10 '20 at 07:26
sounds good but to catch all resources including the JS and CSS just inside the HEAD tag needs to be applied straight after the head tag which unfortunately can be and another regex needed to find that I guess — WilliamK, Sep 10 '20 at 07:37
Getting on error when the script is hard code in http because it adds src=http://example.com even though "src" is not there which breaks the JavaScript — WilliamK, Sep 12 '20 at 05:07

DarkBee · Accepted Answer · 2020-09-10T09:37:50.833

If you are using simple HTML dom you can use the following snippet to adjust URL's

<?php
    require 'simple_html_dom.php';

    class Parser {
        protected $url;
        protected $url_parts;

        protected $html_dom = null;
        protected $path = null;

        public function __construct($url) {
            $this->setUrl($url);
        }

        protected function setUrl($url) {
            $this->url = $url;
            $this->url_parts = parse_url($url);
            return $this;
        }

        protected function getUrl() {
            return $this->url;
        }

        protected function getUrlParts() {
            return $this->url_parts;
        }

        protected function getHtmlDom() {
            if ($this->html_dom === null) $this->html_dom = file_get_html($this->getUrl());
            return $this->html_dom;
        }

        /** ------------
            - path ends with /, e.g. foo/bar/foo/, so the full path for the relative image is foo/bar/foo
            - path doesn't end with / e.g. foo/bar/foo, so the full path the relative image is foo/bar
        ------------ **/
        public function getPath() {
            if ($this->path === null) $this->path = isset($this->getUrlParts()['path']) ? implode('/', explode('/', $this->getUrlParts()['path'], -1)) : '';
            return $this->path;
        }

        public function getHost() {
            return (isset($this->getUrlParts()['scheme']) ? $this->getUrlParts()['scheme'] : 'http').'://'.$this->getUrlParts()['host'];
        }

        public function adjust($tag, $attribute) {
            foreach($this->getHtmlDom()->find($tag) as $element) {
                if (parse_url($element->$attribute, PHP_URL_SCHEME) === null) {
                    // Test if SRC starts with /, if so only append host part of the URL cause image starts at root
                    if (strpos($element->$attribute, '/') === 0) {
                        $element->$attribute = $this->getHost().$element->$attribute;
                    }else{
                        $element->$attribute = $this->getHost().$this->getPath().'/'.$element->$attribute;
                    }
                }
            }

            return $this;
        }

        public function getHtml() {
            return (string)$this->getHtmlDom();
        }
    }

    $parser = new Parser('https://www.darkbee.be/stack/images/index.html');

    $parser->adjust('img', 'src')
           ->adjust('a', 'href')
           ->adjust('link', 'href')
           ->adjust('script', 'src');
           ;

    echo $parser->getHtml();

See last update. Switched it up to a class to avoid code repetition — DarkBee, Sep 10 '20 at 09:38
Unfortunately this fails with compressed html that uses gzip. The html can be decompressed but not from what file_get_html returns. However file_get_contents does work. For an example of the gibberish that can be shown, try using this script to scrape https://www.yahoo.com - file_get_contents does not work in this script — WilliamK, Sep 11 '20 at 22:41
Unfortunately // links also get clobbered and look like href="https://examplecom//maxcdn.bootstrapcdn.com/bootstrap/3.4.0/css/bootstrap.min.css" — WilliamK, Sep 12 '20 at 08:10
As u did not specify both requirements in your initial question, this is not included in the code example. I'm quite sure u'll be able to manage to adapt the code by yourself. — DarkBee, Sep 14 '20 at 05:58
No. I tried but found that I cannot develop it further to cater for all variations of src links. I don't have the php skills for that and the gods keep derailing my questions so I have no more patience atm. — WilliamK, Sep 14 '20 at 06:14

score 0 · Answer 2 · answered Sep 10 '20 at 06:51

0

You can do <img src=\"(.+)\"> and check if $1 contains "http" .

answered Sep 10 '20 at 06:51

Timberman

647
8
24

Good idea. But unfortunately not all usage of src will be like – WilliamK Sep 10 '20 at 07:06
What about ? @WilliamK – Timberman Sep 11 '20 at 08:11
Can you expand on this and provide some example code? – WilliamK Sep 13 '20 at 07:10

Function for correcting broken links in web page scraped by SimpleHtmlDom

2 Answers2