2

I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)

But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character

This page contains the following errors:

error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.

In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B

What's the easiest way to fix this ? I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to

Failed attemps:

  • utf8_encode() isn't working because the rest of the strings are UTF8 valid.
  • I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
  • htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
  • charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
  • str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
  • str_replace("\0", "", ) ...nope
Rick James
  • 135,179
  • 13
  • 127
  • 222
comdex
  • 319
  • 3
  • 9
  • Have you tried [utf8mb4](http://stackoverflow.com/questions/367711/what-is-the-best-collation-to-use-for-mysql-with-php)? The comments in the topic I refer to explain why some utf-8 collation mechanisms don't produce utf-8 results. – Pyromonk Apr 21 '17 at 04:00
  • Testing it right now, no changes so far. Am i supposed to re-import the strings in the database or it will automatically convert them to the new encoding when switching from utf8 unicode to general_ci ? – comdex Apr 21 '17 at 04:05
  • 1
    [You will have to re-insert the data](http://stackoverflow.com/questions/5575491/what-will-happen-to-existing-data-if-i-change-the-collation-of-a-column-in-mysql), unfortunately. – Pyromonk Apr 21 '17 at 04:08
  • oops, just saw the edit to your post. I'm not familiar with utf8mb4, should i test with utf8_mb4_bin or utf8_mb4_general_ci or utf8_mb4_unicode_ci or ... ? – comdex Apr 21 '17 at 04:09
  • 1
    I apologise for that. I would personally choose unicode_ci. [This topic might be helpful with regard to choosing the proper settings](http://stackoverflow.com/questions/14329314/what-mysql-collation-is-best-for-accepting-all-unicode-characters). – Pyromonk Apr 21 '17 at 04:13
  • ok so i switched to utf8_mb4_unicode_ci , cleared database, re-imported again, but the string still has the symbol :( https://i.gyazo.com/06523fc39813b13af76f4414cefbec5c.png – comdex Apr 21 '17 at 04:41
  • I apologise I made you go through all that trouble. What line return character are you using? If it's \n, can you substitute that with \r\n and see if that helps? What are the numbers inside the square? **Upd**: [it's the line tabulation character](http://www.fileformat.info/info/unicode/char/000B/index.htm). – Pyromonk Apr 21 '17 at 04:48
  • `str_replace("\v", "", $str);` should solve the issue. I should've asked you for a screenshot initially instead of making you commit changes that take quite some time. – Pyromonk Apr 21 '17 at 05:06
  • `str_replace("\x0b", "", $str);`, to be more specific. [\v can have unnecessary side-effects](http://php.net/manual/en/regexp.reference.escape.php). – Pyromonk Apr 21 '17 at 05:14
  • 1
    Woohoo this code fixed it ! Thank you so much !!! I spent so much time trying to fix this issue before coming here so i really don't mind if i had to do unneeded changes... at least i learned something on utf8_mb4 :D – comdex Apr 21 '17 at 05:21
  • Yay! Please write down the steps it took you to solve the issue as an answer to your question when you get the time, submit it and mark your answer as accepted, so that other people can benefit from your solution in the future. – Pyromonk Apr 21 '17 at 05:24
  • Let me know if you would like me to post the answer instead. – Pyromonk Apr 21 '17 at 12:23
  • sure, i'll mark it as accepted answer. thanks again! – comdex Apr 21 '17 at 22:44
  • It sounds like the 0B is a red herring. Please find the original hex. – Rick James Apr 23 '17 at 16:37

2 Answers2

1

Here is a list of what we have found and/or worked through with the original poster:

We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.

Community
  • 1
  • 1
Pyromonk
  • 684
  • 1
  • 12
  • 27
1

If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .

At least come back with the various steps the data takes, so we can help with deducing the source of the problem.

Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.

(using str_replace is just a bandaid)

Rick James
  • 135,179
  • 13
  • 127
  • 222