I'm tasked to read in some file data, do stuff with it, then save this data to a database.
Recently, we've started noticing 'weird' characters pass through the system. It looks like it's a classic case of Unicode characters which we (the dev's) failed to anticipate but of course, should have.
We're not to sure of the proper way we need to handle this data so it finally makes it into the database correctly.
This is the high level process:
- Data arrives via
FTP
in files. (xml
files, btw. e.g.hi.xml
,foo.xml
, etc) - We read the file, split the data into class instances (per element in the XML element array)
- Store each instance to some
Azure queue
. - Another process pops the queue message then reads/cleans/does stuff
- Finally, save this data to our
Sql Server
.
We cannot control the input source - that's handled by third parties.
So - in a .NET environment, what are the main pieces of information we need to make sure we're handling. Like .. pseudo-code stuff..
- When we read the xml file data, we must make sure we read it somehow read it with Unicode support?
- Do we make the safe assumption that this should be all UTF-8? (tip: our input source people would probably have NO idea what Unicode is ... let alone UTF-8, 16, etc). Also, not all xml declarations (we received) have ANY
encoding="UTF-8/16/<anything>"
defined in their xml header/line 1. - How do we properly 'save' this data (once we have it in a class instance) to a queue. Right now, we're serializing the class to a JSON string so we can easily deserialize it back in the other process/next step
- DB fields need to be
NVARCHAR
(vsVARCHAR
)
I am researching the main points we need to cater for (with respect to .NET) so we can read->pass along->finally write some Unicode data from a file->queue->db without losing or munging any of this.
Aside: If we receive bad data, sure, we can't do much with it. e.g. the upstream data source incorrectly serializes their data to a file before they FTP it to us.
ASSUMPTION: The sample file above shows some weird chars. This file is viewed in Visual Studio. The assumption is that those characters are Unicode and that is how VS, displays them. Versus... that's not Unicode but a file that saved the source Unicode, incorrectly.
References (as proof that I attempted to some reading beforehand):
- Related SO question I created to confirm if the source data is good/bad: Is this a valid UTF8 character in this xml file?
- This Unicode char exists in our sample file => What is this character? â\u0080\u0099
- From TheMan himself: http://csharpindepth.com/Articles/General/Unicode.aspx
- Possible nuget package that can do all the heavy lifting: https://github.com/neosmart/unicode.net
- Beginners intro: https://www.codeproject.com/Articles/885262/Reading-and-writing-Unicode-data-in-NET