0

I have a program written in C#.net that downloads our Amazon.com orders and stores them in our local databases.

I ran into an issue where a customer who purchased a product entered a utf8 character (℅) - (\xe2\x84\x85) into an address. This seems like a pretty reasonable thing to do, but my program choked when it ran across this order until I put in the following fix.

                        //get the Address2 subnode
                    XmlNode Address2Node = singleOrder.SelectSingleNode("ShippingAddress/AddressLine2");
                    if (Address2Node != null)
                    {
                        GlobalClass.Address2 = Address2Node.InnerXml;
                        //** c/o Unicode character messed up program.
                            if (GlobalClass.Address2.Contains("℅"))

                            {
                                GlobalClass.Address2 = GlobalClass.Address2.Replace("℅", "c/o");
                            // Console.WriteLine(GlobalClass.Address2.Substring(0,1));
                            }
                        GlobalClass.Address2 = GlobalClass.Address2.Replace("'", "''");
                    }
                    else
                    {
                        GlobalClass.Address2 = "";
                    }

Obviously, this will only work in this one field and with this one utf8 character. Without this fix, when I tried to use Mysql to insert it, I received an error message which basically amounted to there being an error in my Mysql statement and by the time that it was sent to MySQL, it was basically saying to INSERT a record with a string like '\xE2\x84\x85..." plus the rest of the string.

Clearly I have no control over what Amazon is going to allow in the shipping address fields, so I need to account for any odd characters that may come through but I have no idea how to do that. I had hoped that just allowing for utf8 in my connection string (charset=utf8;) would fix it, but that didn't do anything - still had the same error. Perhaps my Google skills are lacking, but I can't seem to find a way to allow for any odd character that may come my way and I don't want to have to wait until someone types it to fix the error.

UPDATE:

What about sending "SET NAMES utf8" as a query? This is sort of out of my MySQL knowledge and I don't want to mess anything up, but would this work? And if so, would all programs that I have that use this database need to send that same query?

UPDATE 2: For those who keep asking for the exception error message, it is:

'MySql.Data.MySqlClient.MySqlException' occurred in MySql.Data.dll Additional information: Incorrect string value: '\xE2\x84\x85 Yo...' for column 'ShipAddressLine2' at row 1

UPDATE 3: From this discussion: SET NAMES utf8 in MySQL? I tried sending "SET NAMES 'cp1250'" and I was surprised to see that this allowed the insert to go through with the ℅ character in there. I gather that perhaps if before I retrieve the info that I send "SET CHARSET 'utf8'" as a query before another MySQL query to retrieve it that perhaps I will get the correct character? I'm encouraged that it went through my program by sending the "SET NAMES 'cp1250'" query, but I want to know what encoding set to use (CP1250 is Eastern European and while we have customers from around the globe, most of our customers are in the United States) and make sure this is sound practice before I go changing all my programs to include this. Anybody?

aynber
  • 22,380
  • 8
  • 50
  • 63
Alan Denke
  • 67
  • 9
  • `my program choked` - That really doesn't tell us much about what the problem was... Was an exception thrown? Was there an error message? – Matthew Watson Sep 10 '18 at 20:56
  • UTF-8 is not a character set, it's an encoding. The character you are talking about is a Unicode "Care Of" symbol (U+2105). I'm assuming it gets encoded as you say when rendered as UTF-8. Have you looked a the MySQL docs to see what characters are allowed/not allowed. In the past, I've sanitized strings before by having a mapping of string to string (for example "℅" to "c/o") and using looping through the mapping StringBuilder.Replace. But, that's tedious and slow. – Flydog57 Sep 10 '18 at 21:00
  • An exception was thrown. There was an error message mentioned above, although I unfortunately didn't write it down and can't recreate it. It was saying that I had an error in my MySQL statement, which was an INSERT statement and somehow it related to this shipping address field which had '℅...' in it. MySQL didn't like the ℅ character and the actual error message that displayed showed it as if I was trying to insert 'xE2\x84\x85...'. – Alan Denke Sep 10 '18 at 21:11
  • And my question isn't really about the error. I fixed that. My question is about whether there is a reasonable way to make strings that have characters like U+2105 in them safe for insertion in MySQL databases. I'm assuming this is a common problem for people who have to take data from the internet that they have no control over and inserting it in their own databases. – Alan Denke Sep 10 '18 at 21:31
  • Are you building the Insert statement by string concatenation with the values versus using a parametrized statement? – TnTinMn Sep 10 '18 at 21:32
  • I guess I'm not using a parametrized statement as I just have the statement typed out with variables rather than using comm.Parameters.Add to add the values to be inserted. I gather that this can make it subject MySQL injection (although I don't understand how or how it's avoided by using parameters). I'll happily change my programs to add a paramatrized solution, but does this have anything at all to addressing my question? Is there same sort of validation that takes place when you add the values as parameters which would prevent my error? – Alan Denke Sep 10 '18 at 21:56
  • With the minimal info of "an error in my Mysql statement", I assume that the parser could not deal with the character. Using parameters should avoid the parsing issue and just feed the data in (at least that is my simplified understanding of them). – TnTinMn Sep 10 '18 at 22:06
  • I added the error statement above, which is: 'MySql.Data.MySqlClient.MySqlException' occurred in MySql.Data.dll Additional information: Incorrect string value: '\xE2\x84\x85 Yo...' for column 'ShipAddressLine2' at row 1 I tried to INSERT a "℅" character using parameters and it generates the exact same error. I will still go ahead and start using parameters because that seems to be the right thing to do, but they DO NOT do any filtering or extra processing that will make my problem go away. I'd still like to know if there's a smart way to catch any and all odd characters that may come up – Alan Denke Sep 11 '18 at 23:37
  • Put this in the connection string: `id=my_user;password=my_password;database=some_db123;charset=utf8;` – Rick James Sep 20 '18 at 19:10
  • I actually tried that solution already (adding "charset=utf8" to the connection string), and it had no effect whatsoever - the same error resulted. – Alan Denke Sep 24 '18 at 20:15
  • And if id DID work, aren't you supposed to use utf8mb4 now? I believe that there are problems with utf8 in MySQL and they recommend everyone use utf8mb4 now. – Alan Denke Sep 24 '18 at 20:29

1 Answers1

0

In case someone else has this issue, I first managed to avoid the error by sending the MySQL Command: SET NAMES 'latin1' to the server before storing data. This allows any of the utf8 characters to be stored without causing a MySQL error (I tested it with several odd characters). This, however, stored the utf8 characters in a cryptic format, so I am going with a better answer below:

In my current solution, I edited the MySQL table and changed the character set for the relevant column that might receive utf8 data. I changed the column's character set to utf8mb4 and the column's collation to utf8mb4_general_ci. This allowed the data to be stored properly so the utf8 characters are correct.

In addition, when setting the connection string, I added charset=utf8mb4;.

string MyConString = "SERVER=*****;" + "DATABASE=******;" + "UID=********;" + "PASSWORD=*********;" + "charset = utf8mb4;" ;

although, as far as I can tell, it save the content to the field the same whether I include the charset= parameter or not.

Alan Denke
  • 67
  • 9
  • That is an awful solution, and it will lead to strange things later. The one nasty I know of is that MySQL will incorrectly compare strings. – Rick James Sep 20 '18 at 19:10
  • Can you tell me why setting the character set to the default would cause MySQL to incorrectly compare strings? Can you give me an example that I can verify? – Alan Denke Sep 24 '18 at 20:13
  • Are you assuming that I have a utf-8 database? My databases use the default, which is latin1 (or latin1_swedish_ci). – Alan Denke Sep 24 '18 at 20:28
  • Since `E28485`` is a UTF-8 encoding, not latin1 nor cp1250 nor anything else, you really should store it into a `CHARACTER SET utf8` or `utf8mb4` column. Sure, latin1 (or `BINARY`) would "work" for storing, but it is deceptive to anyone reading your code. – Rick James Sep 24 '18 at 22:31
  • That was helpful, thank you. I changed the column to utf8mb4 and that also avoids the error and stores the data correctly. I have changed my answer above. – Alan Denke Sep 24 '18 at 23:39