I have been tasked with migrating a Microsoft SQL Server 2005 database to MySQL 5.6 (these are both database servers runnig locally) and would really appreciate some help.
-MSSQL source database has latin1 collation (so has ISO 8859-1 character set right?) but doesn't have any char/varchar fields (any string field is nvarchar/nchar) so all this data should be using the UCS-2 character set.
-MySQL target database wants the character set UTF-8
I decided to use the database migration toolkit in the latest version of the MySQL workbench. at first it worked fine and migrated everything as expected. But I have been totally tripped up upon encountering UCS-2 surrogate pair characters in the MSSQL database.
The migration toolkit copytable program did not provide a very useful error message: "Error during charset conversion of wstring: No error". It also did not provide any field/row information on the problem-causing data and would fail within chunks of 100 rows. So after searching through the 100 rows after the last successful insert I found that the issue seemed to be caused by two UCS-2 characters in one of the nvarchar fields. They are listed as surrogate pairs in the UCS-2 character set. They were specifically the characters DBC0 and DC83 (I got this by looking at the binary data for the field and comparing byte pairs (little endian) with data that was being migrated successfully).
When this surrogate pair was removed from the MSSQL database the row was migrated successfully to MySQL.
Here is the problem:
I have tried to search for these characters in a test MSSQL table (this chartest table is just various test strings an nvarchar field) to prepare a replacement script and keep getting strange results... I must be doing something incorrectly.
Searching for
SELECT * FROM chartest WHERE text LIKE NCHAR(0xdc83)
Will return any surrogate pair character (whether or not it uses DC83), but obviously, only if it is the only character (or part of the pair) in that field. This isn't a big deal since I would like to remove any instance of these anyway (I dont like to remove data like this but I think we can afford it).
Searching for
SELECT * FROM chartest WHERE text LIKE '%' + (NCHAR(0xdc83))+ '%'
Will return every row! Regardless of whether it even has a unicode character present in the field let alone the DC83 character. Is there a better way to find and replace these characters? Or something else I should try?
I have also tried setting the target databse, table, and field character set to UCS-2 but it seems as though it does not make a difference.
I should also mention that this migration is using live data (~50GB database!) while one of the sites that feeds it is taken offline so any solutions to this need to have a quick running time...
I would appreciate any suggestions very much! Please let me know if there is any information I have left out.