0

I am building a web app that allows users to import CSV files into the database. It parses those uploads into string values. I had a case where I uploaded two different files that had the string "New Hartford". When I ran a SELECT statement to get all instances of this string, I noticed that what appears to be an identical string value results in pulling up different rows in the query results:

SELECT * from divisions WHERE name = 'New Hartford'; 

SELECT * from divisions WHERE name = 'New Hartford';

Upon further investigation I found that it was the whitespace character between "New" and "Hartford". Apparently there is something different about one of them. I assume it is a character encoding issue, a control character, or perhaps a different form of a whitespace character. Two questions: [1] How can I detect these variations either before or after I insert the data? [2] How can I normalize white-space characters efficiently and elegantly before or during insert? (I'd ideally like a quick function in the insert statement itself, I think.)

For extra context, I am using MySQL Server 8.0, with the following charsets and collations:

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
[result]
character_set_client        utf8
character_set_connection    utf8
character_set_database      utf8mb4
character_set_filesystem    binary
character_set_results       utf8
character_set_server        utf8mb4
character_set_system        utf8
collation_connection        utf8_general_ci
collation_database          utf8mb4_0900_ai_ci
collation_server            utf8mb4_0900_ai_ci
Adam Friedman
  • 520
  • 6
  • 20

0 Answers0