I am building a web app that allows users to import CSV files into the database. It parses those uploads into string values. I had a case where I uploaded two different files that had the string "New Hartford". When I ran a SELECT statement to get all instances of this string, I noticed that what appears to be an identical string value results in pulling up different rows in the query results:
SELECT * from divisions WHERE name = 'New Hartford';
SELECT * from divisions WHERE name = 'New Hartford';
Upon further investigation I found that it was the whitespace character between "New" and "Hartford". Apparently there is something different about one of them. I assume it is a character encoding issue, a control character, or perhaps a different form of a whitespace character. Two questions: [1] How can I detect these variations either before or after I insert the data? [2] How can I normalize white-space characters efficiently and elegantly before or during insert? (I'd ideally like a quick function in the insert statement itself, I think.)
For extra context, I am using MySQL Server 8.0, with the following charsets and collations:
SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
[result]
character_set_client utf8
character_set_connection utf8
character_set_database utf8mb4
character_set_filesystem binary
character_set_results utf8
character_set_server utf8mb4
character_set_system utf8
collation_connection utf8_general_ci
collation_database utf8mb4_0900_ai_ci
collation_server utf8mb4_0900_ai_ci