Regexp on html cannot extraxt specific info from meta

Question

Fellows I have the following string:

<meta charset="UTF-8">

That can be either

Over an html string and I want to extract the UTF-8. I tried with the follwoing code:

preg_match_all('/^(<\s*meta\s*) charset=[^"]\s*($>)*/ix', $contents, $matches);

But somehow Does not work and I do not know why.

If you are scraping, as you tagged, you are better off learning to use a parser. See http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php — chris85, May 10 '16 at 19:58

Brad Kent · Answer 1 · 2016-05-10T19:55:29.007

0

preg_match_all('/^<meta\s[^>]*charset=["\']([^>]+)["\'])/i', $contents, $matches);

you've got several issues with charset=[^"]\s*($>)*
[^"] = not "
\s* = zero or more spaces (this is ok, but unecessary)
($>)* = not sure what your intent was here. $ anchors to the end of the string... so you trying to match/capture zero or more (">" after the end of the string)s.. (it will always be zero)

edited May 10 '16 at 19:55

answered May 10 '16 at 19:50

Brad Kent

4,982
3
22
26

How can I tell starts or not " in regexp – Dimitrios Desyllas May 10 '16 at 21:08
`["\']*` : " or ' (optional) – Brad Kent May 10 '16 at 21:27

score 0 · Answer 2 · answered May 10 '16 at 20:28

Using DOMDocument class will be more appropriate and accurate way for such cases:

$html_string = '<meta charset="UTF-8">';
$doc = new \DOMDocument();
$doc->loadHTML($html_string);
$charset = $doc->childNodes->item(1)->getElementsByTagName("meta")->item(0)->getAttribute("charset");

print_r($charset);  // "UTF-8"

score 0 · Accepted Answer · answered May 26 '16 at 09:02

0

Finnaly I swiched to guzzle http and got the encoding from HTTP header

answered May 26 '16 at 09:02

Dimitrios Desyllas

9,082
15
74
164

Regexp on html cannot extraxt specific info from meta

3 Answers3