PHP- HTML parsing :: How can be taken charset value of webpage with simple html dom parser?

Question

PHP:: How can be taken charset value of webpage with simple html dom parser (utf-8, windows-255, etc..)?

remark: its have to be done with html dom parser http://simplehtmldom.sourceforge.net

Example1 webpage charset input:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

result:utf-8

Example2 webpage charset input:

<meta content="text/html; charset=windows-255" http-equiv="Content-Type">

result:windows-255

Edit:

I try this (but its not works):

$html = file_get_html('http://www.google.com/');
$el=$html->find('meta[content]',0);
echo $el->charset;

What should be change? (I know that $el->charset not working)

Thanks

Run an xpath query for `//meta[@http-equiv="Content-Type"]/@content`. You'll have to parse the attribute value yourself. — Frank Farmer, Jul 28 '10 at 18:19
Suggested third party alternatives that actually use DOM instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html) and [FluentDom](http://www.fluentdom.org). — Gordon, Jul 28 '10 at 18:28

MvanGeest · Accepted Answer · 2010-07-28T18:51:45.193

3

You'll have to match the string using a regular expression (I hope you have PCRE...).

$el=$html->find('meta[http-equiv=Content-Type]',0)
$fullvalue = $el->content;
preg_match('/charset=(.+)/', $fullvalue, $matches);
echo $matches[1];

Not very robust, but should work.

edited Jul 28 '10 at 18:51

answered Jul 28 '10 at 18:29

MvanGeest

9,536
4
41
41

Thanks! I fix a bit and its works see my answer fix. $html = file_get_html('http://www.google.com/'); $el=$html->find('meta[content]',0); $fullvalue = $el->content; preg_match('/charset=(.+)/', $fullvalue, $matches); echo substr($matches[0], strlen("charset=")); – Ben Jul 28 '10 at 18:49
**Don't do that**, I made a mistake. It should be `$matches[1]`. That makes it a lot faster and more reliable. – MvanGeest Jul 28 '10 at 18:52

score 2 · Answer 2 · answered Jul 28 '10 at 18:30

2

$dd = new DOMDocument;
$dd->loadHTML($data);
foreach ($dd->getElementsByTagName("meta") as $m) {
    if (strtolower($m->getAttribute("http-equiv")) == "content-type") {
        $v = $m->getAttribute("content");
        if (preg_match("#.+?/.+?;\\s?charset\\s?=\\s?(.+)#i", $v, $m))
            echo $m[1];
    }
}

Note that the DOM extension implicitly converts all the data to UTF-8.

answered Jul 28 '10 at 18:30

Artefacto

96,375
17
202
225

Now that's a bit more robust than what I wrote... :) – MvanGeest Jul 28 '10 at 18:31
Thanks for this option, because its very important to have utf-8 data. – Ben Jul 28 '10 at 18:34
@Mva yeah, Content-Type is sometimes written "Content-type". At least in the http headers, case doesn't matter. – Artefacto Jul 28 '10 at 18:35
DomDocument not convert proper text always to utf-8. I still working to handle this problem. – Ben Jul 30 '10 at 13:48

score 1 · Answer 3 · answered Jul 28 '10 at 18:48

1

Thanks for MvanGeest answer - I just fix a bit and its works perfect.

$html = file_get_html('http://www.google.com/');
$el=$html->find('meta[content]',0);
$fullvalue = $el->content;
preg_match('/charset=(.+)/', $fullvalue, $matches);
echo substr($matches[0], strlen("charset="));

answered Jul 28 '10 at 18:48

Ben

25,389
34
109
165

Weird... it's working for me. You don't need the `substr` though... just `$matches[1]`. I tested it using Google. – MvanGeest Jul 28 '10 at 22:01

PHP- HTML parsing :: How can be taken charset value of webpage with simple html dom parser?

Edit:

3 Answers3

Linked