0

I browsed several questions regarding this topic here already, however, without a solution. I am loading data into a DomDocument-Object. Once printing the results of a Xpath-query it prints a � instead of a ñ.

My controller includes:

public function index() {
    $data = array();
    $dom = new DomDocument();    
    @$dom->loadHTML(mb_convert_encoding(file_get_contents('http://www.example.com'), 'HTML-ENTITIES', 'UTF-8'));
    $xpath = new DomXpath($dom);
    foreach($xpath->query('//span') as $element) {
        $data['titles'[] = $element->nodeValue;
    }

    $this->load->view('view_example', $data);
}

My view_example.php includes:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

My view_example.php also includes:

<?php

foreach($titles as $element) {
    echo $element;
}

?>

My config file includes:

$config['charset'] = 'UTF-8';

Also, I re-checked the character-encoding of every file using Komodo Edit and NotePad++. I picked UTF-8 without BOM for every file.

When removing the @ sign it prints the following warning, is it relevant to this case?:

DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: X

  • If you view the source code of the web page in your browser what do you see? – MonkeyZeus Jan 23 '14 at 18:38
  • In the source code it displays the � aswell. – user3228959 Jan 23 '14 at 18:41
  • I would try putting the data into a file like this and opening the file with NotePad++ in UTF-8 mode. `file_put_contents('contents.txt', $this->load->view('view_example', $data, true));` – MonkeyZeus Jan 23 '14 at 18:43
  • It shows the � in the contents.txt aswell - I opened the file with NotePad++. – user3228959 Jan 23 '14 at 18:46
  • What is the encoding of the website you are scraping? – MonkeyZeus Jan 23 '14 at 18:47
  • Maybe this is relevant: http://stackoverflow.com/a/11978382/2191572 – MonkeyZeus Jan 23 '14 at 18:49
  • In FireFox it says UTF-8 when checking the site's info. However, there is nothing set in the source code. Is there a proper way to find out which encoding is used? – user3228959 Jan 23 '14 at 18:50
  • Where does FireFox say UTF-8? The `charset` can be set in the headers that the server sends you so that the HTML does not have to tell your browser what to do. If you have FireBug then I would recommend visiting the website and going to the Network tab and find the Response Headers section. In there you should find `content-type: Something` – MonkeyZeus Jan 23 '14 at 18:52
  • FireBug's result: Content-Type text/html; charset=utf-8 - Result of - `htmlspecialchars_decode(utf8_decode(htmlentities(file_get_contents('http://www.example.com'), ENT_COMPAT, 'utf-8', false)));` It now shows a question-mark (?) instead of the ñ. – user3228959 Jan 23 '14 at 18:56
  • Hmm, so what does the source code in-browser and source code from `file_put_contents()` show now? – MonkeyZeus Jan 23 '14 at 19:00
  • Both, source-code and contents.txt show the question-mark aswell, I forgot to mention that. – user3228959 Jan 23 '14 at 19:02
  • Out of curiosity what happens if you remove `mb_convert_encoding()` like this `@$dom->loadHTML(file_get_contents('http://www.example.com'));`? – MonkeyZeus Jan 23 '14 at 19:36
  • Then it shows ¿½ instead of the ñ everywhere (website, source-code and contents.txt). – user3228959 Jan 23 '14 at 19:45
  • Why are you suppressing errors like this `@$dom`? Try removing the `@` symbol – MonkeyZeus Jan 23 '14 at 19:49
  • As stated in my first post, it prints 'DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: X' a lot of times. – user3228959 Jan 23 '14 at 20:03
  • I did not notice the edit. Anyways check this out http://stackoverflow.com/a/5086048/2191572 – MonkeyZeus Jan 23 '14 at 20:09
  • That is correct, however, now I don't get valid HTML when using: `htmlentities(file_get_contents('http://www.example.com'));` due to it for example turning `Hello World` into `<span>Hello World&lt/span>`. So I have a problem when using DomDocument::loadHTML(). How can I just treat the problematic characters in this case? – user3228959 Jan 23 '14 at 20:44
  • Is the other website markup valid? This seems to be the root of your issue. – MonkeyZeus Jan 23 '14 at 20:46
  • In the source-code I can find the ñ. – user3228959 Jan 23 '14 at 21:01
  • Is there a way to make the website's markup valid before loading it as HTML? – user3228959 Jan 23 '14 at 21:46

0 Answers0