-1

Fellows I have the following string:

<meta charset="UTF-8">

That can be either

Over an html string and I want to extract the UTF-8. I tried with the follwoing code:

preg_match_all('/^(<\s*meta\s*) charset=[^"]\s*($>)*/ix', $contents, $matches);

But somehow Does not work and I do not know why.

Dimitrios Desyllas
  • 9,082
  • 15
  • 74
  • 164
  • If you are scraping, as you tagged, you are better off learning to use a parser. See http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php – chris85 May 10 '16 at 19:58

3 Answers3

0
preg_match_all('/^<meta\s[^>]*charset=["\']([^>]+)["\'])/i', $contents, $matches);

you've got several issues with charset=[^"]\s*($>)*
[^"] = not "
\s* = zero or more spaces (this is ok, but unecessary)
($>)* = not sure what your intent was here. $ anchors to the end of the string... so you trying to match/capture zero or more (">" after the end of the string)s.. (it will always be zero)

Brad Kent
  • 4,982
  • 3
  • 22
  • 26
0

Using DOMDocument class will be more appropriate and accurate way for such cases:

$html_string = '<meta charset="UTF-8">';
$doc = new \DOMDocument();
$doc->loadHTML($html_string);
$charset = $doc->childNodes->item(1)->getElementsByTagName("meta")->item(0)->getAttribute("charset");

print_r($charset);  // "UTF-8"
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
0

Finnaly I swiched to guzzle http and got the encoding from HTTP header

Dimitrios Desyllas
  • 9,082
  • 15
  • 74
  • 164