1

I have got a weird problem. I'm getting text from Google cloud vision containing non English characters but they are actually English characters. It is a mistake from Google cloud vision OCR.

I'm getting a character like this : Héllo

Notice that é is non English character.

I want to convert into simple "Hello" so I can process this word.

I'm not looking for the programming answer. I'm just looking for ways to do this.

Any hint would be useful.

Thanks!

1 Answers1

0

If Apache Commons is an option for you, you could make use of their StringUtils library. The stripAccents method should suit your needs. From the source code you can see that it actually makes use of java.text.Normalizer, so you could also look into that.

vox
  • 420
  • 2
  • 4
  • Thanks a lot! I didn't know word "strip accents" I found the answer after searching from this word. Here is the answer for python: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string – Abhishek Deshkars Jun 10 '20 at 05:54