-1

I am scraping HTML using SimpleHtmlDom which gets the HTML as written, resulting in a lot of broken links to images and scripts because they do not include the full url to their resource location. Consequently the pages show with errors.

I have already corrected resource links like src="/, etc by replacing those letters with src="http://example.com/" but it gets tricky when there is no leading slash in the link, making it difficult to tell if it is a local link or a full link.

For example:

<img src="images/pic.jpg">

I need to locate and correct to read:

<img src="http://example.com/images/pic.jpg">

Is there a regex or function that I can use to I find src=" when there is no leading slash? Also need to cater for all types of links such as ahref, script, etc.

WilliamK
  • 821
  • 1
  • 13
  • 32
  • In a nutshell I need find instances of src=" and determine if it is lonely or includes http – WilliamK Sep 10 '20 at 07:18
  • It would probably be a whole lot easier, if you just inserted a `` element with the proper URL set into the document … https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base – CBroe Sep 10 '20 at 07:26
  • sounds good but to catch all resources including the JS and CSS just inside the HEAD tag needs to be applied straight after the head tag which unfortunately can be and another regex needed to find that I guess – WilliamK Sep 10 '20 at 07:37
  • Getting on error when the script is hard code in http because it adds src=http://example.com even though "src" is not there which breaks the JavaScript – WilliamK Sep 12 '20 at 05:07

2 Answers2

1

If you are using simple HTML dom you can use the following snippet to adjust URL's

<?php
    require 'simple_html_dom.php';

    class Parser {
        protected $url;
        protected $url_parts;

        protected $html_dom = null;
        protected $path = null;

        public function __construct($url) {
            $this->setUrl($url);
        }

        protected function setUrl($url) {
            $this->url = $url;
            $this->url_parts = parse_url($url);
            return $this;
        }

        protected function getUrl() {
            return $this->url;
        }

        protected function getUrlParts() {
            return $this->url_parts;
        }

        protected function getHtmlDom() {
            if ($this->html_dom === null) $this->html_dom = file_get_html($this->getUrl());
            return $this->html_dom;
        }

        /** ------------
            - path ends with /, e.g. foo/bar/foo/, so the full path for the relative image is foo/bar/foo
            - path doesn't end with / e.g. foo/bar/foo, so the full path the relative image is foo/bar
        ------------ **/
        public function getPath() {
            if ($this->path === null) $this->path = isset($this->getUrlParts()['path']) ? implode('/', explode('/', $this->getUrlParts()['path'], -1)) : '';
            return $this->path;
        }

        public function getHost() {
            return (isset($this->getUrlParts()['scheme']) ? $this->getUrlParts()['scheme'] : 'http').'://'.$this->getUrlParts()['host'];
        }

        public function adjust($tag, $attribute) {
            foreach($this->getHtmlDom()->find($tag) as $element) {
                if (parse_url($element->$attribute, PHP_URL_SCHEME) === null) {
                    // Test if SRC starts with /, if so only append host part of the URL cause image starts at root
                    if (strpos($element->$attribute, '/') === 0) {
                        $element->$attribute = $this->getHost().$element->$attribute;
                    }else{
                        $element->$attribute = $this->getHost().$this->getPath().'/'.$element->$attribute;
                    }
                }
            }

            return $this;
        }

        public function getHtml() {
            return (string)$this->getHtmlDom();
        }
    }

    $parser = new Parser('https://www.darkbee.be/stack/images/index.html');

    $parser->adjust('img', 'src')
           ->adjust('a', 'href')
           ->adjust('link', 'href')
           ->adjust('script', 'src');
           ;

    echo $parser->getHtml();
DarkBee
  • 16,592
  • 6
  • 46
  • 58
  • See last update. Switched it up to a class to avoid code repetition – DarkBee Sep 10 '20 at 09:38
  • Unfortunately this fails with compressed html that uses gzip. The html can be decompressed but not from what file_get_html returns. However file_get_contents does work. For an example of the gibberish that can be shown, try using this script to scrape https://www.yahoo.com - file_get_contents does not work in this script – WilliamK Sep 11 '20 at 22:41
  • Unfortunately // links also get clobbered and look like href="https://examplecom//maxcdn.bootstrapcdn.com/bootstrap/3.4.0/css/bootstrap.min.css" – WilliamK Sep 12 '20 at 08:10
  • As u did not specify both requirements in your initial question, this is not included in the code example. I'm quite sure u'll be able to manage to adapt the code by yourself. – DarkBee Sep 14 '20 at 05:58
  • No. I tried but found that I cannot develop it further to cater for all variations of src links. I don't have the php skills for that and the gods keep derailing my questions so I have no more patience atm. – WilliamK Sep 14 '20 at 06:14
0

You can do <img src=\"(.+)\"> and check if $1 contains "http" .

Timberman
  • 647
  • 8
  • 24