En Dash and Start of Guarded Area characters

Question

I'm trying to figure out why the same source data gives me two different output strings depending on the method I use to get it.

I have two CSV files containing data from QuickBooks. One was created using QuickBooks' built-in reporting functionality and the other was created by using a data-access API that uses the QuickBooks SDK. In both of these CSV files, there is a text column which I should be able to use as a key to relate the data in said files.

However, there is one particular character in one particular line that the two files can't seem to agree on:

In QuickBooks, the character has the visual appearance of a dash
In the CSV created directly by QuickBooks, the character is exported as an en-dash (U+2013 or decimal code 8211)
BUT the SDK-base API reads it from QuickBooks as the "Start of Guarded Area" character (U+0096 or decimal code 150).

This causes a problem because my code thinks the two strings are different (which they technically are, but shouldn't be) and therefore fails to match them. I'm convinced there must be some kind of encoding error somewhere along the line, but I can't find any link between the two characters.

I don't expect someone to be able to figure out exactly what's going on, since we don't have access to what QuickBooks or the API are doing behind-the-scenes. But I'm hoping someone can give me some idea as to why this character is being mis-translated.

score 3 · Accepted Answer · answered Oct 29 '18 at 15:04

rouckas's answer reminded me that I did actually solve this problem. He's mostly right, but the problem had nothing to do with a web browser so I thought I'd provide exactly what I did to fix things.

As far as I can tell, QuickBooks actually stores and outputs its data using windows-1252 (which is the encoding used when exporting to a text file from QB). But when the data is read through the SDK-based API, somewhere along the line the windows-1252 codes get incorrectly interpreted as Unicode (either by the QB SDK, the 3rd party API or the .NET Framework itself; I have no way of knowing).

This works most of the time because the character codes for 0 to 127 (which includes all the letters in the English alphabet) are the same between the two encodings. But starting with 128 the two schemes diverge, so 150 in windows-1252 means "en-dash" but in Unicode it means "Start of Guarded Area".

To correct for this I used the following code:

Dim Builder As New Text.StringBuilder(Input)
For i = 0 To Builder.Length - 1
    Dim n = AscW(Builder(i))

    If n > 127 AndAlso n < 256 Then
        Dim b As Byte = n
        Builder(i) = System.Text.Encoding.Default.GetChars({b})(0)
    End If
Next

Return Builder.ToString

This gets the character-code for each character (using AscW) and if the code is between 127 and 256 (exclusively) (255 being the last character in windows-1252), interprets it correctly using the windows-1252 encoding and then converts it properly to Unicode.

score 1 · Answer 2 · answered Oct 24 '18 at 11:26

The problem is that they are (probably) encoding en-dash as U+0096 internally, which corresponds to the Windows-1252 byte (0x96) for en-dash, but in Unicode, it actually represents "Start of Guarded Area" special character.

For some backward compatibility reasons, web browsers convert this character to U+2013 for displaying on a webpage.

So there are two problems - wrong encoding on the side of QuickBooks and a confusing behavior of the browser, which is converting the character from windows-1252 to Unicode.

There are several related questions concerning this issue:

En Dash and Start of Guarded Area characters

2 Answers2