I have domain names being submitted with characters like \u8236
, but every time it is something else. How can I safely remove all the bad characters without knowing which ones are there?
Asked
Active
Viewed 122 times
0

Mark Rotteveel
- 100,966
- 191
- 140
- 197

realPro
- 1,713
- 3
- 22
- 34
-
Why not encode the url before submission? – Jamshaid K. Feb 21 '21 at 09:53
-
1Does this answer your question? [How to recognize if a string contains unicode chars?](https://stackoverflow.com/questions/4459571/how-to-recognize-if-a-string-contains-unicode-chars) – Jamshaid K. Feb 21 '21 at 09:57
-
`I have domain names being submitted` Why are people submitting domain names to you? To what end? – mjwills Feb 21 '21 at 11:40
-
I you are getting an http request you should have a separated page for each language. The language should be in the http header. Once you know the language than you can apply an encoding associated with the language – jdweng Feb 21 '21 at 14:18
-
1You should read about localized domains. Characters of various charsets should be allowed (there are also top level domains). But before to transmit them, you should translate into "ASCII-like" domain names (and back to display to users). Check how browsers allows such non ASCII domain names. – Giacomo Catenazzi Feb 22 '21 at 08:10
-
@GiacomoCatenazzi thank you... yes I already understand that all the localized domains are a major obstacle that prevent me from doing a simple Unicode remove. – realPro Feb 23 '21 at 21:22