0

This is a customer database, the "Name" field is a customer input. The green one is as it should be, regular font. We found a few instances of these special characters, and I don't know if they can harm the DB in any way or cause API issues later so they need to be sanitized.

What are these special characters, and is there any regex we can use on a WHERE clause to find all entries?

Name column on SQL database, highlighted green is the correct one, all other 4 are special characters we may need to sanitize.

NAME:




Amauri (correct)

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
recacon
  • 73
  • 6
  • 3
    welcome to the unicode world – Iłya Bursov Oct 18 '22 at 15:22
  • Can you view the data as hex and see if indeed there are special characters in there? – Gary_W Oct 18 '22 at 15:26
  • 1
    As SQL Server doesn't natively support Regex, then you might be hard pressed. As for what the characters are, some fonts have multiple type faces built into them and it appears that some of your users have decided that they wanted to use those characters when choosing their name. *Why* they wanted to do that... Well, you didn't stop them, would be the obtuse answer. – Thom A Oct 18 '22 at 15:26
  • @Larnu "Font" and "type face" are generally synonymous, so "some fonts have multiple type faces built into them" makes no sense. What's actually happening here is that some fonts have *letter-like symbols built into them*, intended for use in contexts like mathematical formulas. It's really no different from writing your name as "L4RNU", or "ARN". – IMSoP Oct 18 '22 at 15:31

1 Answers1

0

That isn't really a font, that is UNICODE.

I bet the datatype in the field is something like NVARCHAR().

That N stands for UNICODE support.

Here is a good introduction on the difference between VARCHAR and NVARCHAR:

https://www.sqlshack.com/sql-varchar-data-type-deep-dive/

So to answer your question: You don't need to sanitize that data, unless some applications using the database cannot handle UNICODE. In that case it might be better to update those applications to something with UNICODE support.

Erwin Moller
  • 2,375
  • 14
  • 22
  • Thank you! Using this solution, I was able to find all Non-ASCII names: https://stackoverflow.com/questions/29819200/detect-unicode-characters-that-are-not-ascii-in-table – recacon Oct 18 '22 at 17:57