How can I get html webpage charset encode from html as string and not as dom?

Question

I get html string like that:

$html = file_get_contents($url);
preg_match_all (string pattern, string subject, array matches, int flags)

but i dont know regex, and I need to find out webpage charset (UTF-8/windows-255/etc..) Thanks,

You should check the HTTP header for a character encoding first and only if missing check the HTML after. — Gumbo, Jul 31 '10 at 21:27

score 6 · Accepted Answer · answered Jul 31 '10 at 21:31

6

preg_match('~charset=([-a-z0-9_]+)~i',$html,$charset);

answered Jul 31 '10 at 21:31

CrayonViolent

32,111
5
56
79

this seems to suppose that `$html` contains the http header, which it does not. – mvds Jul 31 '10 at 21:38
1

Please no. What if I happen to be parsing a page that explains how to define the encoding of a page?... – Artefacto Jul 31 '10 at 21:40
That's assuming it happens before. `meta` can come after the `title` tag, an old `meta` tag may be commented out, etc etc. This is also not a good solution because the HTTP headers have priority. – Artefacto Jul 31 '10 at 21:44
I will concede to commented out tags, But overall, he asked for a regex given his current code which uses file_get_contents() to get the html. That is what I gave him. – CrayonViolent Jul 31 '10 at 21:46
Thanks this exactly what i need, after i check your regex works great! – Ben Jul 31 '10 at 22:08

score 1 · Answer 2 · edited May 23 '17 at 12:27

1

First thing you have to check the Content-type header.

//add error handling
$f = fopen($url, "r");
$md = stream_get_meta_data($f);
$wd = $md["wrapper_data"];
foreach($wd as $response) {
    if (preg_match('/^content-type: .+?/.+?;\\s?charset=([^;"\\s]+|"[^;"]+")/i',
             $response, $matches) {
         $charset = $matches[1];
         break;
    }
}
$data = stream_get_contents($f);

You can then fallback on the meta element. That's been answered before here.

More complex version of header parsing to please the audience:

if (preg_match('~^content-type: .+?/[^;]+?(.*)~i', $response, $matches)) {
    if (preg_match_all('~;\\s?(?P<key>[^()<>@,;:\"/[\\]?={}\\s]+)'.
            '=(?P<value>[^;"\\s]+|"[^;"]+")\\s*~i', $matches[1], $m)) {
        for ($i = 0; $i < count($m['key']); $i++) {
            if (strtolower($m['key'][$i]) == "charset") {
                $charset = trim($m['value'][$i], '"');
            }
        }
    }
}

edited May 23 '17 at 12:27

Community

1
1

answered Jul 31 '10 at 21:29

Artefacto

96,375
17
202
225

what happened to pattern delimiters and case sensitivity? – mvds Jul 31 '10 at 21:33
regex has no delims and that greedy capture is gonna give a lot more than you want back – CrayonViolent Jul 31 '10 at 21:33
why dont you use file_get_contents instead fopen? I need to get html to other tesks after – Ben Jul 31 '10 at 21:34
@Crayon I forgot the delimiters, but I had non-greedy quantifiers there all the time. – Artefacto Jul 31 '10 at 21:34
@Yosef Because I needed to get the headers for the request. `file_get_contents` returns a string immediately so you have to change to fetch them. – Artefacto Jul 31 '10 at 21:35
really? well what do you call (.*) then? – CrayonViolent Jul 31 '10 at 21:35
@Crayon: greedy but it will not eat a newline. – mvds Jul 31 '10 at 21:37
@Crayon That's greedy, but it's the last thing in the expression; it doesn't make any difference. – Artefacto Jul 31 '10 at 21:37
@Crayon It will be, unless the server is violating the HTTP protocol. – Artefacto Jul 31 '10 at 21:41
@Crayon I think you're mistaking HTTP headers for HTML data. – Artefacto Jul 31 '10 at 21:44
@mvds Damn you and your references :p All right, I'll fix it. – Artefacto Jul 31 '10 at 21:51
@Artefacto: since you're getting the points here.. is the charset always the first parameter? ;-) – mvds Jul 31 '10 at 22:12
@mvds Ah, I've already hit the cap a few hours ago. I'll fix it, though. – Artefacto Jul 31 '10 at 22:23

score 0 · Answer 3 · answered Jul 31 '10 at 21:24

0

you could use

mb_detect_encoding($html);

but it is generally a bad idea. Better use curl instead and look at the Content-Type header.

answered Jul 31 '10 at 21:24

mvds

45,755
8
102
111

Then maybe *"use curl instead and look at the Content-Type header"* – mvds Jul 31 '10 at 21:36

How can I get html webpage charset encode from html as string and not as dom?

3 Answers3