3

I just want to know if its possible to extract content encoded (in utf-8) from a html file without encoding header.

My specific case is this website:

http://www.metal-archives.com/band/discography/id/203/tab/all

I want to extract all the info but, as you can see, this word for example, looks bad:

Motörhead

I tried to use file_get_html, htmlentities, utf_decode, utf_encode and mix of them with different options but I cant find a solution...

Edit:

I just want to see the same website with correct format with this simple code:

$html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/223/tab/all");
//some transform/decode here
print_r($html_discos);

I want the content in correct format in a string or DOM object to get some parts later.

Edit 2:

$file_get_html is a function of "simple html dom" library:

http://simplehtmldom.sourceforge.net/

That have this code:

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    // We DO force the tags to be terminated.
    $dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);
    // For sourceforge users: uncomment the next line and comment the retreive_url_contents line 2 lines down if it is not already done.
    $contents = file_get_contents($url, $use_include_path, $context, $offset);
    // Paperg - use our own mechanism for getting the contents as we want to control the timeout.
    //$contents = retrieve_url_contents($url);
    if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
    {
        return false;
    }
    // The second parameter can force the selectors to all be lowercase.
    $dom->load($contents, $lowercase, $stripRN);
    return $dom;
}
Gerard Brull
  • 1,042
  • 1
  • 9
  • 23
  • It appears that way to me in the browser as well, so the problem may be on the site's end. – Explosion Pills Nov 09 '12 at 12:42
  • It's not clear from your question where your problem is located. Show your code and give a snappy example where *you* see invalid data and in which encoding you display that string. – hakre Nov 09 '12 at 12:51
  • PHP does not have any `file_get_html` function. Unless you don't share any details about that function, from the code-example you currently give, not much can be said. a simple `header('Content-Type: text/html; charset=utf-8');` before any output is done might already do the job, but that is just guessing. – hakre Nov 09 '12 at 12:56
  • Okay, I already suspected you are using the Simple HTML DOM library, as it now shows, you are. Consider it to be broken. Instead use `DOMDocument` which ships with PHP and see this question on how to load an UTF-8 encoded website: [How to keep the Chinese or other foreign language as they are instead of converting them into codes?](http://stackoverflow.com/q/10237238/367456) – hakre Nov 09 '12 at 13:18

3 Answers3

2

The Content-Type of the URL

http://www.metal-archives.com/band/discography/id/203/tab/all

is:

Content-Type: text/html

This will default to ISO-8859-1. But instead you want to use UTF-8. Change the Content-Type so this is correctly signaled:

Content-Type: text/html; charset=utf-8

See: Setting the HTTP charset parameter

hakre
  • 193,403
  • 52
  • 435
  • 836
  • It's my understanding that he does not control that website – Explosion Pills Nov 09 '12 at 12:44
  • 1
    @ExplosionPills: Then I wonder what the question actually is. – hakre Nov 09 '12 at 12:45
  • It's a external website, I can't control the website as Explosion Pills said. Edit: The question is how to see the correct format (if it's possible) – Gerard Brull Nov 09 '12 at 12:46
  • 1
    Sure, but you read it in. Just take the input as UTF-8 (it is UTF-8) and you have no problem. Signal with *your* website that you have UTF-8. There is nothing more you need to do. Otherwise per each text-string one call to `utf_decode` will convert an UTF-8 string into an ISO-8859-1 encoded one. But then you need to signal with the output that your data *is* ISO-8859-1 encoded otherwise it would be broken (again). – hakre Nov 09 '12 at 12:47
1
header('Content-Type: text/html; charset=utf-8');
echo file_get_contents('http://www.metal-archives.com/band/discography/id/203/tab/all');

As long as you are emitting as UTF-8, the raw data will work properly.

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
0

Try using html_eneity_decode http://php.net/manual/en/function.html-entity-decode.php (the source of that page has encoded characters)

Martin Lyne
  • 3,157
  • 2
  • 22
  • 28
  • Tried: $html_discos = file_get_html("http://www.metal-archives.com/band/discography/id/203/tab/all"); $html_discos = html_entity_decode($html_discos); print_r($html_discos); Saw the same... – Gerard Brull Nov 09 '12 at 12:49