1

How can I get html webpage charset encode from html as string and not as dom?

I get html string like that:

$html = file_get_contents($url);
preg_match_all (string pattern, string subject, array matches, int flags)

but i dont know regex, and I need to find out webpage charset (UTF-8/windows-255/etc..) Thanks,

Ben
  • 25,389
  • 34
  • 109
  • 165
  • 1
    You should check the HTTP header for a character encoding first and only if missing check the HTML after. – Gumbo Jul 31 '10 at 21:27

3 Answers3

6

preg_match('~charset=([-a-z0-9_]+)~i',$html,$charset);

CrayonViolent
  • 32,111
  • 5
  • 56
  • 79
  • this seems to suppose that `$html` contains the http header, which it does not. – mvds Jul 31 '10 at 21:38
  • 1
    Please no. What if I happen to be parsing a page that explains how to define the encoding of a page?... – Artefacto Jul 31 '10 at 21:40
  • That's assuming it happens before. `meta` can come after the `title` tag, an old `meta` tag may be commented out, etc etc. This is also not a good solution because the HTTP headers have priority. – Artefacto Jul 31 '10 at 21:44
  • I will concede to commented out tags, But overall, he asked for a regex given his current code which uses file_get_contents() to get the html. That is what I gave him. – CrayonViolent Jul 31 '10 at 21:46
  • Thanks this exactly what i need, after i check your regex works great! – Ben Jul 31 '10 at 22:08
1

First thing you have to check the Content-type header.

//add error handling
$f = fopen($url, "r");
$md = stream_get_meta_data($f);
$wd = $md["wrapper_data"];
foreach($wd as $response) {
    if (preg_match('/^content-type: .+?/.+?;\\s?charset=([^;"\\s]+|"[^;"]+")/i',
             $response, $matches) {
         $charset = $matches[1];
         break;
    }
}
$data = stream_get_contents($f);

You can then fallback on the meta element. That's been answered before here.

More complex version of header parsing to please the audience:

if (preg_match('~^content-type: .+?/[^;]+?(.*)~i', $response, $matches)) {
    if (preg_match_all('~;\\s?(?P<key>[^()<>@,;:\"/[\\]?={}\\s]+)'.
            '=(?P<value>[^;"\\s]+|"[^;"]+")\\s*~i', $matches[1], $m)) {
        for ($i = 0; $i < count($m['key']); $i++) {
            if (strtolower($m['key'][$i]) == "charset") {
                $charset = trim($m['value'][$i], '"');
            }
        }
    }
}
Community
  • 1
  • 1
Artefacto
  • 96,375
  • 17
  • 202
  • 225
0

you could use

mb_detect_encoding($html);

but it is generally a bad idea. Better use curl instead and look at the Content-Type header.

mvds
  • 45,755
  • 8
  • 102
  • 111