0

PHP: 7.2.5 Laravel: 7.25

We have a bug where a very small number of users are trying to insert copy with the '' character included. I'm assuming this is because of a copy and paste from a PDF, I have seen them before with line breaks. This produces the following error:

SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF4\x8F\xB0\x80</...' for column 'body' at row 1 (SQL: update `post` set `body` = <p></p>, `body_raw` = , `post`.`updated_at` = 2020-10-06 10:34:22 where `id` = 1)

Character '':

  • Decimal Character Codes: 56319, 56320
  • Hexadecimal Character Codes: 0xdbff, 0xdc00
  • HTML with named character references: &#xdbff; &#xdc00;

Looking at Google, a suggestion is that you could update the DB encoding from utf8 to utf8mb4. This is probably the optimal solution, but we have a large database and I'm uneasy amending the encoding (though this may be very safe). I'm concerned about possible data loss/corruption.

As this issue is only appearing on this 1 character in our bug system, and its 100% not required, I'm inclined to just remove it before saving it in the database, to create the minimum changes.

I'm inclined to do the following:

str_replace("","", $post);

But if I paste the character '' into any of my code editors it disappears (I assuming utf8 encoding). What would the best way to accomplish this?

alexmcfarlane
  • 1,016
  • 2
  • 13
  • 33
  • whats your database collation? I think its not `utf8mb4_unicode_ci` – STA Oct 06 '20 at 10:58
  • How is `\xF4\x8F\xB0\x80` the same as `0xdbff, 0xdc00`? – 04FS Oct 06 '20 at 11:00
  • 1
    _“But if I paste the character '' into any of my code editors it disappears”_ - as an alternative, you can use `chr` to create the correct byte sequence from the (decimal) integer byte values. – 04FS Oct 06 '20 at 11:05
  • @sta the collation of the body field is 'utf8_unicode_ci' – alexmcfarlane Oct 06 '20 at 11:55
  • @04FS From the MySQL error statement, I copied the character out '' and pasted it into this website, http://www.mauvecloud.net/charsets/CharCodeFinder.html. If I post that same character from the frontend I get the same error. – alexmcfarlane Oct 06 '20 at 11:57
  • on your `config/database.php` you can try with make `'strict' => true,` to `'strict' => false`, – STA Oct 06 '20 at 12:09
  • @04FS That is the kind of solution I was thinking, but I'm struggling to get it to work. If I go; dd($post->body); = "

    " then to test for str_replace. dd(strpos($post->body, chr(56320))); = false
    – alexmcfarlane Oct 06 '20 at 12:14
  • @sta Thanks, but for this solution, I'm looking to try and just remove the character before getting to the database. – alexmcfarlane Oct 06 '20 at 12:16
  • then use `preg_replace` instead of `str_replace`, ex : `preg_replace('/[^A-Za-z0-9\-]/', '', $post);` it will remove all special character from your string – STA Oct 06 '20 at 12:21
  • 1
    Make a debug output of `urlencode($post->body)`, the percent encoding in that place will help check what byte values you are actually dealing with. – 04FS Oct 06 '20 at 12:30
  • @sta thanks again, but I just want to remove that 1 special character. – alexmcfarlane Oct 06 '20 at 12:38
  • Its working here https://rextester.com/VCIDNY41598 – STA Oct 06 '20 at 13:03
  • 1
    @sta that's exactly what I'm trying to accomplish, but if you copy your code into either PhpStorm or Sublime text the special character disappears. I've found a solution, written up below. But thanks for your help. – alexmcfarlane Oct 06 '20 at 13:14

1 Answers1

0

With great help from @04FS (thanks). I have found a solution. As mentioned, I think the database utf8 to utf8mb4 fix is probably the best route here. But as not to amend the database, here is the solution I have found.

The main confusing issue here is with the character "". As I can not enter it into my text editors it was hard to work with. So I relied on 3rd party sites to encode it. One suggestion was to use char() to be able to write and match the character. But on 2 different websites, the character code came out both char(111) and char(244). With char(244) I was able to use str_replace, but it only created a partial replacement and broke the SQL query.

@04FS mentioned trying urlencode() which gave me '%F4%8F%B0%80' for that character. This matches the database error. So the following solution works correctly:

  private function removeSpecialCharacters($str) {
      $str = str_replace(urldecode('%F4%8F%B0%80'), '', $str);
      return $str;
  }
alexmcfarlane
  • 1,016
  • 2
  • 13
  • 33