2

I'm tasked to read in some file data, do stuff with it, then save this data to a database.

Recently, we've started noticing 'weird' characters pass through the system. It looks like it's a classic case of Unicode characters which we (the dev's) failed to anticipate but of course, should have.

enter image description here

We're not to sure of the proper way we need to handle this data so it finally makes it into the database correctly.

This is the high level process:

  • Data arrives via FTP in files. (xml files, btw. e.g. hi.xml, foo.xml, etc)
  • We read the file, split the data into class instances (per element in the XML element array)
  • Store each instance to some Azure queue.
  • Another process pops the queue message then reads/cleans/does stuff
  • Finally, save this data to our Sql Server.

We cannot control the input source - that's handled by third parties.

So - in a .NET environment, what are the main pieces of information we need to make sure we're handling. Like .. pseudo-code stuff..

  • When we read the xml file data, we must make sure we read it somehow read it with Unicode support?
  • Do we make the safe assumption that this should be all UTF-8? (tip: our input source people would probably have NO idea what Unicode is ... let alone UTF-8, 16, etc). Also, not all xml declarations (we received) have ANY encoding="UTF-8/16/<anything>" defined in their xml header/line 1.
  • How do we properly 'save' this data (once we have it in a class instance) to a queue. Right now, we're serializing the class to a JSON string so we can easily deserialize it back in the other process/next step
  • DB fields need to be NVARCHAR (vs VARCHAR)

I am researching the main points we need to cater for (with respect to .NET) so we can read->pass along->finally write some Unicode data from a file->queue->db without losing or munging any of this.

Aside: If we receive bad data, sure, we can't do much with it. e.g. the upstream data source incorrectly serializes their data to a file before they FTP it to us.

ASSUMPTION: The sample file above shows some weird chars. This file is viewed in Visual Studio. The assumption is that those characters are Unicode and that is how VS, displays them. Versus... that's not Unicode but a file that saved the source Unicode, incorrectly.

References (as proof that I attempted to some reading beforehand):

halfer
  • 19,824
  • 17
  • 99
  • 186
Pure.Krome
  • 84,693
  • 113
  • 396
  • 647
  • Have you tried adding a aditional step like "cleanup" xml before, "We read the file, split the data into class instances (per element in the XML element array)" step? There is a method called "XmlConvert.IsXmlChar [https://msdn.microsoft.com/en-us/library/system.xml.xmlconvert.isxmlchar(v=vs.110).aspx]". Have you seen that? That might help. – Kosala W Sep 05 '17 at 04:31
  • yep. we do things like that too. But we thought an XML Char is not the same as a unicode/utf8 char... ?? – Pure.Krome Sep 05 '17 at 04:33
  • As far as I know, The above function is used to make sure Xml conforms to section 2.2 of https://www.w3.org/TR/2006/REC-xml-20060816/#charsets. it says "All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode 3.1" – Kosala W Sep 05 '17 at 04:39
  • Link (in first comment) doesn't work btw? – Pure.Krome Sep 05 '17 at 04:45
  • The msdn link https://msdn.microsoft.com/en-us/library/system.xml.xmlconvert.isxmlchar(v=vs.110).aspx – Kosala W Sep 05 '17 at 05:45
  • Note: Unless there are extraordinary restrictions on the document content, you always need Unicode support in that all characters are Unicode, regardless of the document encoding. Numeric character entity references represent Unicode codepoints (e.g. 🚲 for ). – Tom Blodget Sep 05 '17 at 17:21

1 Answers1

2

The solution is the right choice of Encoding. The encoding is the format to use when the binary data is transformed into a string. You have to pay specific attention when the data leaves one specific environment and enters the next one. You have to make sure that in each of this transformation no data is lost (e.g. if there is an ASCII format in the chain you will lose some information). Information that is lost can not be recovered.

You have the following transitions you have to take care of and which you have to handle separately:

  1. The transformation from XML to a .NET string.
  2. (Maybe) a transformation from .NET to the queue.
  3. (Maybe) a transfromation from the queue to to the worker process.
  4. The transfromation from the worker process to the DB.

I assume that you have the transfromations 2. and 3. under control (if everything is .NET there is no issue at all here). The key issues are 1. and 4. (but I'll handle 2. also, since the third part of your question refers to it).

Reading the XML into .NET

According to the XML standard, the encoding of an XML file can be either declared in the XML declaration, is determined by a byte order mark, or it is known from an external source (e.g. an HTTP header). In chapter 4.3.3 of the relevant document it says:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8.

That means if the documents you receive do not have an XML declaration (which the standard allows) or a BOM, they have to be assumed to be UTF-8. Of course in practice, standards are not always followed, so whether the data you receive is in the correct format depends on how strictly your source conforms with the standard (which refers to to your first question). You should probably clear that up if you can. Depending how mch flexibility you have with rejecting invalid data, you can also decide to follow the XML standard (that's what standards are for) and make it the source's responsibility to make sure the data is correct.

If you use the .NET classes in the System.Xml and/or System.Xml.Linq namespaces, especially an XmlReader directly on the binary source (without transforming it to a string), XML declaration and BOM are automatically evaluated and the XML is read in the correct format. To create an XML reader directly from the stream is simple:

Stream inputStream;
// Create a stream from your data, depending on the source (e.g. FileStream)
XmlReader reader = XmlReader.Create(inputStream);
// Use the reader

This way provides full unicode support (which should answer your first question).

Saving the data to the queue

If you are happy with your solution of creating a JSON, you shouldn't change that. The encoding used for storing the data has nothing to do with the high-level format you are using (plain text, JSON, XML etc.). As long as you stay within .NET, strings are stored in Unicode and you don't lose any data. However, if you have any serialization/deserialization in the process chain, make sure to use matching encodings for serialization and deserialization that do not involve data loss (which would be UTF-8 or Unicode, whatever is more efficient in your scenario).

Saving the data to the DB

Depending on your source data, you can either choose a collation that can store any character you might encounter (of course you need to know that first) and use VARCHAR or you can use NVARCHAR, which will be able to store any possible Unicode text. The former uses less storage space, but might come with some data loss in case some unexpected characters are found in your input data. The latter can store anything, but needs double the space.

Sefe
  • 13,731
  • 5
  • 42
  • 55