0

I have a $string in PHP. It doesn't matter where this comes from (it's from incoming e-mails); the important part is that sometimes, it's not valid UTF-8 according to PostgreSQL, but is valid according to PHP.

I explicitly set both mb_internal_encoding('UTF-8') and mb_regex_encoding('UTF-8'). I explicitly set client_encoding to 'UTF8' (yes, it wants it without the '-') when making the PostgreSQL database connection. I have verified over and over again that the PG database itself uses UTF8. In short: everything on my system uses UTF-8 encoding.

Details: PHP 7.4.1. PG 11.5. Windows 10. (The same thing has happened for years and years for many versions of PHP/PG/Windows.)

Before trying to INSERT a record containing $string, I make the following integrity/safety check to avoid errors:

function string_is_valid_UTF8($string)
{
    if (!mb_check_encoding($string, 'UTF-8'))
        return false;
    else
        return true;
}

if (string_is_valid_UTF8($string))
    // Proceed to INSERT it into the database since PHP says it's valid UTF-8 data.

Occasionally -- NOT every time! -- PostgreSQL barks at this, even though it has been checked by PHP to be valid UTF-8. It spits out/logs this error:

pg_query_params(): Query failed: ERROR:  invalid byte sequence for encoding "UTF8"

I don't get it. The only explanation I can see is that PostgreSQL and PHP have different ideas of what is valid UTF-8. This problem has bugged me for years and I just never seem to get it resolved. Again and again, sometimes with weeks or months in between, some external data coming into my system causes this issue. In spite of my check!

Is there something I can tell PostgreSQL to make it handle this differently? I don't want that error to be logged. It's really, really annoying.

At this point, I'm utterly baffled as to how this can happen. Is PHP or PostgreSQL at wrong? Considering how many times I've dealt with this and trying to solve it by a zillion different methods, it doesn't seem reasonable that it's me doing something wrong at this point.

  • Output the query and execute in postgres, does it work? Maybe the string is getting cut. – user3783243 Jan 10 '20 at 17:17
  • 3
    One of the comments on [the manual page](https://www.php.net/manual/en/function.mb-check-encoding.php#89286) for `mb_check_encoding()` might shred some light over the issue: _"This function does not check for bad byte sequence(s), it only checks if the byte stream is valid."_ – M. Eriksson Jan 10 '20 at 17:31
  • @MagnusEriksson I must be missing something. Okay, so that function is not properly coded. But then which one *is*? How do I actually check if the string is *actually* valid UTF-8? That is, no "bad byte sequences". The answer to this is repeatedly ignored by any threads I find when searching. –  Jan 10 '20 at 18:55
  • You wrote: 'NOT every time!'. Are you saying this happens occasionally when given identical input data? Or do you mean it happens occasionally that you get input data where this happens. If so, what is that input data? Make it reproducable. Thats the only way to find out what is going on. – clamp Jan 10 '20 at 23:30
  • @clamp Naturally, I mean that it happens "occasionally" as in with different input data. As buggy and weird as computers are, they simply don't answer differently with the same logic and input! (Unless it's made on purpose to be randomized, of course.) As for the exact input data, I don't save that, so I can't show it, but *why* would you need it? It's just invalid UTF-8, and that's what I want to detect. Why does nobody seem to understand what I'm asking? –  Jan 10 '20 at 23:49

0 Answers0