0

I am trying to load the meta description of this website (which has a German character) via the following script in PHP:

$page_content = file_get_contents($uri);
$dom_obj = new \DOMDocument();
$dom_obj->loadHTML(mb_convert_encoding($page_content, 'HTML-ENTITIES', 'UTF-8'));

However, while trying to write it into the MySQL db, Laravel says it ran into troubles trying to write that into the db: incorrect string value "\xC3" (which is the German character)

When I simply do the following, writing to the db works. But the character is not displayed correctly (ü instead of ü)

$dom_obj->loadHTML($page_content)

This problem only occurs with this website so far, others I tried with the same character do work. Can you think of a possible reason and fix? Thank you!

Edit:

It works fine, when I use PHPs "utf8_decode" to decode the meta description that I get via $dom_obj without mb_convert_encoding. When I do this, all other sites that worked before lead to errors (like this: Incorrect string value: '\xE4t')

Hillcow
  • 890
  • 3
  • 19
  • 48
  • 1
    While writing it into the DB use utf8 encoding for the table and column you are inserting or reading it – Punith R Kashi Jul 02 '18 at 17:40
  • https://stackoverflow.com/questions/279170/utf-8-all-the-way-through?s=1|822.2639 – AbraCadaver Jul 02 '18 at 17:49
  • Latin1 `C3` is `Ã`, which I don't think of as German. On the other hand, several German characters, when encoded in UTF-8 are 2 bytes, the first of which is hex `C3`. For example: `ß` is hex `C39F`. – Rick James Jul 02 '18 at 18:15
  • Are you converting the site source to 7-bit US-ASCII because your application and/or database is not using UTF-8? What encoding are you using then? – Álvaro González Jul 02 '18 at 18:25
  • I am using Laravel. Laravel uses utf8mb4. @RickJames its not just Ã, its ü. – Hillcow Jul 02 '18 at 18:27
  • Why the `mb_convert_encoding()` part then? Whatever, [`ü`](https://apps.timwhitlock.info/unicode/inspect?s=%C3%BC) is encoded as `C3 BC 00` in UTF-8. If MySQL complaints it means input is not been handled as UTF-8 all the way. – Álvaro González Jul 02 '18 at 18:30
  • Hmmm... It fact it should be `0xC3 0xBC`, no idea why Unicode Inspector adds zeroes... – Álvaro González Jul 02 '18 at 18:31
  • I don't know. When I use other websites it works just fine, not sure what is different with this one. But I need it to work with every website (whether or not it is in UTF8 or not) – Hillcow Jul 02 '18 at 18:36
  • The whole point of UTF-8 is that it's a Unicode compatible encoding so it works with all sites. So you need to use UTF-8 properly, not replace it with anything else. – Álvaro González Jul 02 '18 at 18:50
  • I'm not sure what you mean? Where am I replacing it? I have no control over which sites descriptions I will need to fetch. – Hillcow Jul 02 '18 at 18:52
  • Please check the edit in the initial post, thanks – Hillcow Jul 02 '18 at 19:01
  • 1
    Not sure how to explain. I just mean that UTF-8 is a 100% Unicode compatible encoding thus it's the one and only tool you need to deal with any human script. However, you're transforming data to other encodings for no clear reason. `mb_convert_encoding($page_content, 'HTML-ENTITIES', 'UTF-8')` basically converts from UTF-8 to US-ASCII (this shouldn't break anything by pure chance but it's unnecessary). `utf8_decode()` converts from UTF-8 to ISO-8859-1. And so on. You just need to ensure you use UTF-8 properly all through the process—`substr()` on multi-byte characters is a good example. – Álvaro González Jul 03 '18 at 07:13

1 Answers1

2

I found the error. I was using substr to shorten the description. Apparently substr cut off one of those special characters and this is why it wasnt working.

foreach($dom_obj->getElementsByTagName('meta') as $meta) {
  if($meta->getAttribute('name')=='description'){
    substr($meta->getAttribute('content'), 0, 156);

This is a workaround:

mb_substr($foo,0,156,"UTF-8");
Hillcow
  • 890
  • 3
  • 19
  • 48