0

character B can have

HEX code = 0x42 (real letter) or 0x412 (fake letter)

DEC code = 66 (real letter) or 1042 (fake letter)

HTML with named char ref = B (real letter) or В (fake letter)

Java string = B (real letter) or \u0412B (fake letter)

When I parse content from remote URL with CURL I see on macOS in both variants letter B. But really it could be not real letter B. I check it with this online tool is letter real or not.

This code helps me for one letter:

$content = str_replace("В", "B", $content);

But how can I make the same with PHP for all other illegal characters?

kostya572
  • 169
  • 2
  • 21
  • Are "illegal/fake" characters any outside of ASCII? – user3783243 Jun 18 '19 at 01:56
  • 4
    That's not a "fake" character. It's a cyrillic one and is pronounced as "V", but happens to look exactly like the latin B character. You can't magically convert these ... if you want only latin letters, then you might want to restrict your character set to ASCII (or a similar one that doesn't include all alphabets on the planet). – Narf Jun 18 '19 at 02:04
  • @user3783243 Yes, I think fake chars is outside of ASCII. But is there any command that can translit it. When I search in Excel file for product SKU code "B250C" it's can't be found due to incorrect first charecter B. If I retype from keyboard "B250C" it founds line in Excel. – kostya572 Jun 18 '19 at 02:05
  • @Narf Wow, yes you are right! I forgot that russian language has the same letter) – kostya572 Jun 18 '19 at 02:07
  • Are all the characters you want to convert known? An uppercase greek beta has the same issue. A `B` is not the same as a beta, they have very different meanings in science. – user3783243 Jun 18 '19 at 02:08
  • @user3783243 yes they're known, here is the same letters in cyrillyc and latin: Е, Н, У, Х, В, А, О, С, М, Т – kostya572 Jun 18 '19 at 02:12
  • Put them in an array and use `str_replace`. – user3783243 Jun 18 '19 at 02:14
  • @user3783243 you or Narf could answer the question and I will accept it. – kostya572 Jun 18 '19 at 02:17
  • Possible duplicate of [Cyrillic transliteration in PHP](https://stackoverflow.com/questions/7461406/cyrillic-transliteration-in-php) – user3783243 Jun 18 '19 at 02:25
  • 1
    You are talking about Unicode confusables. https://www.unicode.org/Public/security/latest/confusables.txt But why do you say that characters are "illegal?" – Tom Blodget Jun 18 '19 at 08:29
  • @TomBlodget I wrote it for a better understanding of the question. You are right it's correct char in utf8 (out of ASCII) that could be confusable. – kostya572 Jun 18 '19 at 12:42
  • You forgot Р and К; also, depending on the font used, the lowercase versions of Н and И can be confused for lowercase versions of N and U respectively, and З (pronounced like a "Z") looks a lot like the number 3. Plus, I think Greek had a character that looked exactly like a semicolon, and there surely are many more - again, you can't rely on a blacklist. Transliteration, on the other hand, would convert В to V, Х to H, etc. Sorry about not writing this as an answer ... I don't feel like writing it out properly right now; up for grabs. – Narf Jun 18 '19 at 12:54
  • @Narf I agree with you. Also I forgot about Ukrainian character "і" which uppercase the same as "I". – kostya572 Jun 18 '19 at 13:02

0 Answers0