0

I'm receiving e-mails via IMAP with PHP.

Before trying to INSERT each new incoming e-mail message into my database, I make a basic check so that the body text (both the plaintext and HTML versions, if both exist) are "valid UTF-8", and if not, I just drop it and skip processing it further. I do so with the following code, which I determined to be the right one after spending countless hours of my life searching online and trying things myself, for literally years:

function string_is_valid_UTF8($string)
{
    if (!mb_check_encoding($string, 'UTF-8'))
        return false;
    else
        return true;
}

Occasionally, this doesn't seem to matter, because an e-mail slips through to the PHP code which then INSERTs it into the PostgreSQL database table, and thus this happens:

pg_query_params(): Query failed: ERROR:  invalid byte sequence for encoding "UTF8": 0xa0:

No matter what checks I make beforehand, some always slip through, logging that error. Again and again...

What exactly do I need to do to make sure it never happens?! What is wrong about the code I have? Why does PHP say it's valid UTF-8 but PostgreSQL doesn't?! How is that even possible?

The latest e-mail, which prompted me to again try to ask about this, was some garbled spam letter which only had a HTML part. It contains messed-up UTF-8 somewhere. Of course, it doesn't matter what it contains, or what parsed it out like that. What matters is that PHP sees it as "OK" and PG sees it as "wrong", and so that damn error is logged as a result instead of the whole e-mail silently being ignored, as I desire.

What am I doing wrong? This has been torturing me for a very long time now and I need to get it resolved once and for all!

  • Are you sure this isn't just a duplicate of: https://stackoverflow.com/questions/4867272/invalid-byte-sequence-for-encoding-utf8 and you haven't set the encoding of the DB to UTF-8? – Mike Guelfi Jan 10 '20 at 04:17
  • Yes, I am sure that it isn't a duplicate of that question. As I always am, because the "duplicates" are never what I ask about. –  Jan 10 '20 at 04:22
  • Yes, I am extremely sure that everything is using UTF-8: PHP, the DB connection, and the PG database. –  Jan 10 '20 at 04:23
  • 2
    @InsultExchange Well, if you are so certain, then the error cannot occur, and your problem is gone. – Laurenz Albe Jan 10 '20 at 07:08
  • 1
    @LaurenzAlbe Clearly it *can* occur, since it did? I don't follow your logic. –  Jan 10 '20 at 07:26
  • @InsultExchange Right. The logical conclusion is that you should not be so certain that everything involved is proper UTF-8, because obviously it isn't. To debug this, get a hold of the string that caused the problem and examine its bytes. Try feeding it to your function. – Laurenz Albe Jan 10 '20 at 07:35
  • @LaurenzAlbe I said I was sure because I am sure. All of it uses UTF-8. *You* are convinced that this must be the only possible error, and that it's my fault, and that I'm wrong about it since it doesn't fit into your idea of what the only possible explanation is. As for "try feeding it to your function", I don't understand what you mean. That's what I already did when the e-mail was received. That's when it was said to be valid UTF-8 by PHP but still wasn't. –  Jan 10 '20 at 07:40
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/25057013) – Bill Tür stands with Ukraine Jan 10 '20 at 07:40
  • And what do you mean by "examine its bytes"? Why? I already know that it's invalid UTF-8 since PG rejected it, so what do I need to do to make PHP properly detect it as such? What use is there in analyzing a broken string? I just need the code to be able to properly detect it as broken or valid -- it doesn't matter how exactly it is broken. –  Jan 10 '20 at 07:42
  • @InsultExchange There you go. Once you have identified the broken string, you can come up with a minimal sample of PHP code that reproduces the problem. Then you can ask a better question that does not involve PostgreSQL at all and has a good chance to get a good answer (if you don't solve the problem yourself along the way). This technique is known as "debugging" in the trade. – Laurenz Albe Jan 10 '20 at 07:49
  • What made you conclude that it's invalid UTF-8 and that the PHP function is in error? It could also be that it's _valid_ UTF-8 and that Postgre is mistaken. Only by examining the suspect data manually can you make sure which. – Mr Lister Jan 10 '20 at 12:46
  • The fact that PG says it's invalid is what made me conclude it. Even if PG is "mistaken", then what is the solution? Is PG some pre-alpha software from 1992 or decades into the making in 2020? I mean, how can such an issue exist at all? And I don't keep the incorrect e-mails, so I can't analyze it anyway. And I can't find any examples online of broken UTF-8 strings to check, of course. Nothing works properly and I'm completely out of things to try at this point. –  Jan 10 '20 at 16:31

0 Answers0