-1

I'm trying to encode strings like Hügelkultur in PHP to Hügelkultur.

I'm trying things like htmlentities(str) and htmlentities(str, ENT_XML1) but it leaves it unchanged. urlencode(str) gave me H%C3%BCgelkultur but that isn't what I'm trying to get.

What function should I use? Does that type of encoding have a name?

Funk Forty Niner
  • 74,450
  • 15
  • 68
  • 141
GFL
  • 1,278
  • 2
  • 15
  • 24
  • 1
    Is there a file involved, or other source? Taking a look at some of your other questions, seen one being db-related; is this related to that also? – Funk Forty Niner Aug 28 '19 at 23:46
  • 2
    Because if you specify your encoding properly you don't have to do this at all. https://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Sammitch Aug 29 '19 at 00:11
  • @FunkFortyNiner it is related. I was able to get it out of the SQLite database in one format but not in the exact way I need it to be. I was hoping to use PHP which is more versatile to modify the file from the database to get where it needs to be. – GFL Aug 29 '19 at 02:48

1 Answers1

2

There's no built-in for this because you only have this problem if you're doing other, more important things incorrectly and this just papers over them.

See: UTF-8 all the way through

But if you're committed to not actually fixing that and making your application more difficult to maintain, you could use the following to encode UTF-8 codepoints above 127 as HTML entities:

function force_utf8_entities($input) {
    return implode('', array_map(
        function($a){
            if( strlen($a) > 1 ) {
                return sprintf("&#x%X;", mb_ord($a));
            }
            return $a;
        },
        mb_str_split($input)
    ));
}

$input = "Hügelkultur";
var_dump(
    force_utf8_entities($input)
);

It's also worth noting that there's no such thing as "non-lower ASCII", as every byte with an ordinal representation above 127 is entirely at the mercy of the declared encoding. UTF, ISO8859-X, and MS cpXXXX encodings will all hotly disagree about what those bytes represent on the screen.

This is where the term "7-bit safe" comes from, because no matter how badly you muck up your encodings in transit, you can be reasonably sure that bytes below 127 will make it through.

edit

"Extended ASCII" still is not a thing.

If you display a byte above 127 the symbol presented on the screen will be different depending on the encoding it is interpreted as. People with western-european alphabets are somewhat coddled because our funny accented letters tend to be the defaults [ISO8859-1 and cp1252] but when you switch to eastern-european charsets [ISO8859-5 and cp1251] you're going to see ќ instead of ü.

It's worth noting that the FC in ü is not a byte value, it is the un-encoded UTF code point. Again, users of western-european alphabets are spoiled, and frequently confused, by the overlap in the code point space. uFC encoded as UTF-8 is the literal two-byte sequence C3 BC. hence your urlencode() output.

Really, the truth is that there's not such thing as "ASCII" at all. It's just that most non-asian encodings tend to agree that it's easier to just leave the traditional first 127 bytes the same everywhere so as not to freak out the english people.

Community
  • 1
  • 1
Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • 1
    Just thought I'd say that I liked what you submitted. – Funk Forty Niner Aug 29 '19 at 00:51
  • 1
    Senpai has noticed me. (๑•́ω•̀๑) – Sammitch Aug 29 '19 at 00:58
  • Thank you! I was searching for the term extended ASCII but it wasn't coming to mind. I'm trying to wrap my head around the code. I'm not able to use mb_str_split() (undefined function). I see its multibyte. That's going to be pretty much required for what I need, isn't it? The formatting mismatch is being driven by data from two different sources (openzim and sqlite) that need to be matched up for a different destination (graphviz). If it was my code I could fix the problems there, but I'm limited to the output I can get. – GFL Aug 29 '19 at 03:08
  • `mb_str_split()` is available only since PHP 7.4 – Dharman Nov 10 '19 at 12:01