4

I am new to using the Google translate API and during testing we noticed that for some translations (I have not been able to find a pattern yet) we get \u200b characters in the response. That results in a lot of issues and above all it does not seem to server any purpose or make any sense. As simple example:

https://www.googleapis.com/language/translate/v2?key=YOURKEY&source=NL&target=EN&q=Hergeneer%20verkopen

returns:

{
 "data": {
  "translations": [
   {
    "translatedText": "Sell \u200b\u200bHerge Down"
   }
  ]
 }
}

Our software stumbles over these \u200b strings/characters and I have not found a way to prevent them or get rid of them.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Peter
  • 41
  • 2
  • Possible duplicate of [What's HTML character code 8203?](https://stackoverflow.com/questions/2973698/whats-html-character-code-8203) – Cœur Aug 06 '18 at 12:44

1 Answers1

0

Please read the documentation of the JSON format: https://json.org/

A string is a sequence of zero or more Unicode characters.
A char is either any Unicode character except " or \ or control-character,
[...]
or it is \u followed by four hex-digits.

We are in this last case, \u followed by four hex-digits, and it represents a Unicode character: Unicode Character 'ZERO WIDTH SPACE' (U+200B). It even has its own Wikipedia page: Zero-width space. And its Stack Overflow question: What's HTML character code 8203?.

Now, there are plenty Unicode characters with special behaviors, and this is one of those, an invisible one among others. So you need to be aware of how Unicode works, and you should sanitize input/output from third-parties API (and from user inputs as well).

Just define the list of characters that you actually want to support, and be sure to strip or filter out all the other ones. For instance, if you desire to support NL and EN, then you could strip what is outside the Latin script in Unicode.

Stripping the U+200B that you're encountering and other undesirable characters may save you from potential surprises like with:

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • It's probably not safe to say "we never want U+200B anywhere" but it would be convenient to have a process for removing it selectively. There are a few characters like this which are invisible and/or hard to notice if you are unattentive, which are just a consequence of having humans do the work. Sometimes they copy/paste stuff from somewhere, sometimes they type a different character than they are supposed to because they don't know the difference. – tripleee Aug 06 '18 at 13:34
  • 1
    Thanks for the heads-up; I'll try to come up with a workaround. :-} – tripleee Aug 06 '18 at 16:45