1

I'm using Google translate to translate some text.

Sometimes, the Google translator adds non-printable characters in the translated text.

For example, go to this page: https://www.google.com/search?client=ubuntu&channel=fs&q=traduttore&ie=utf-8&oe=utf-8

Choose from Italian to English and translate leone marino.

The result will be:

sea ​​lion
   ^ here there are other two non-printable chars, exactly before the "l" char

You can test it by putting the text anywhere you can change it (for example in a text editor or in a text field in any web page, or even in the browser url) and moving with the keyboard arrows you will notice that the cursor will stops twice more close to the character of the space.

Leaving aside the reason why these characters are inserted, how can I remove all these non-printable characters using a Regex with PHP and/or using sublime text?

And, how to see the unicode version of these characters?

user2342558
  • 5,567
  • 5
  • 33
  • 54
  • 1
    `preg_replace('~\p{Cf}+~u', '', $s)` to remove all [other format Unicode chars](http://www.fileformat.info/info/unicode/category/Cf/list.htm) or just `str_replace("\u{200B}", "", $s)` – Wiktor Stribiżew Jul 03 '19 at 07:48
  • @WiktorStribiżew, thanks it seems working, can you help me also for the question "how to see the unicode version of these characters"? – user2342558 Jul 03 '19 at 07:51

1 Answers1

2

To remove all other format Unicode chars you may use

$s = preg_replace('~\p{Cf}+~u', '', $s);

Since you want to remove a zero-width space, you may just use

$s = str_replace("\u{200B}", "", $s);

I use https://r12a.github.io/app-conversion/ (no affiliation) to check for hidden chars in strings:

enter image description here

Possible PHP code to convert a string to \uXXXX representation to quickly see the Unicode code points for non-ASCII chars:

$input = "sea ​​lion";
echo preg_replace_callback('#[^ -~]#u', function($m) {
    return substr(json_encode($m[0]), 1, -1);
}, $input); 
// => sea \u200b\u200blion
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • FYI: https://3v4l.org/JRtrP ([modifed code from this post](https://stackoverflow.com/questions/40139833/convert-unicode-symbols-to-uxxxx-not-using-json-encode)) can help to find these chars in PHP. – Wiktor Stribiżew Jul 03 '19 at 08:13
  • Thanks for the code, I'll use http://phpfiddle.org which allow to paste text containing non-printable chars and showing them with a red dot. – user2342558 Jul 03 '19 at 08:28