0

I need to validate that a string (in a Django form) contains "sensible" characters before sending this to an API that won't allow chars like emojis. It's somewhat unclear what chars this API won't allow, but I figured that it will allow all "sensible" chars that's available to type on a normal keyboard.

That includes !"#€%&/() etc. etc. and i.e. swedish chars like åäö, but I want to raise a validation error when chars like emojis etc. are in the string.

Thanks for any help solving this.

saturnusringar
  • 149
  • 3
  • 13
  • It is not emoji, right? – vahdet May 15 '19 at 13:06
  • You're right, it's probably called emoji. I'll edit the question. – saturnusringar May 15 '19 at 13:08
  • "chars like emoticons etc" is not well enough defined to be able to ask a computer to filter them. You need to figure out what exact rule you want to apply. – Louis Saglio May 15 '19 at 13:09
  • I guess if I could filter the chars and allow all chars that's on this list: http://www.fileformat.info/info/charset/UTF-8/list.htm (is it possible to filter by Encoded Byte somehow?) it would work. I just tested with some random, obscure, characters from that list and the API accepted them. – saturnusringar May 15 '19 at 13:24
  • You want them to be UTF-8 characters which is an ascii string as far as I know. https://stackoverflow.com/questions/196345/how-to-check-if-a-string-in-python-is-in-ascii possible duplicate of that one. – Billy Ferguson May 15 '19 at 14:41
  • The linked answer did not work very well. is_ascii returns False for swedish chars like åäö. – saturnusringar May 15 '19 at 14:57
  • then modify the `is_ascii()` function to check for latin characters: `ord(c) < 0xcaaf` – dirkgroten May 15 '19 at 16:24
  • oops I meant `ord(c) < 0x02af` – dirkgroten May 15 '19 at 16:31
  • @dirkgroten it seems to work quite well with my (simple, manual) tests now. These chars didn't pass the test: £∞≈, but I can live with that. They won't be entered here, so they are not "sensible". Thanks for the help. – saturnusringar May 15 '19 at 21:03
  • £∞≈ ... comes from the chars that I tested which are easily entered on my keyboard. – saturnusringar May 15 '19 at 21:11
  • You can also exclude some other ranges of course by adding more tests for `ord(c)`, just look at the UTF-8 list mentioned above. – dirkgroten May 16 '19 at 06:10

0 Answers0