0

I am pulling some data from a SQL Server database and writing it to a text file and, for the most part, the process is working as intended. There is one issue that I've been unable to resolve. Apostrophes are showing up as: ’.

Here is the code writing to the file:

using (var writer = new StreamWriter(filePath, false))
{
    foreach (var textLine in dataList)
    {
        writer.WriteLine(textLine);
    }
}

I have tried using Encoding.Default and Encoding.Utf8 on the text, but that didn't make a difference.

I'm opening up the files in Notepad, Notepad++, and UltraEdit.

Can anyone help me identify this problem?

Ron Saylor
  • 381
  • 6
  • 19

2 Answers2

2

Are you sure you're trying to store a real apostrophe (character code 39) and not one of the smart-quote characters? https://en.wikipedia.org/wiki/Quotation_mark_glyphs

Michael Gunter
  • 12,528
  • 1
  • 24
  • 58
  • I'm not. The text that is written to the file can be entered in many ways (typed, copy/pasted, etc). In some cases, the apostrophe shows up fine and in others, I get the characters shown in the question above. – Ron Saylor Jan 09 '14 at 19:40
  • If someone is copy-pasting, especially if doing so from Word, chances are they're pasting a smart quote. You may need to pre-process the input to convert these, if necessary. Or just accept the fact that you may need to store wide characters. – Michael Gunter Jan 09 '14 at 19:42
  • 2
    @RonS It looks like you are getting curly apostrophes: http://stackoverflow.com/a/2477480/424129 – 15ee8f99-57ff-4f92-890c-b56153 Jan 09 '14 at 19:42
  • @EdPlunkett I am. I pulled up the specific note to confirm. Is this something I can fix prior to writing it to the file? – Ron Saylor Jan 09 '14 at 20:09
  • @RonS The simple, ugly way is to replace both left and right curly apostrophes with seven-bit ASCII straight apostrophes in the text before writing it. I'd do the same with curly quotes as well. The alternative would be using the debugger to figure out exactly what you're getting from the DB (UTF16, maybe?) and make sure that's what you're writing to the text file, with the correct byte-order mark at the beginning of the text file. Conceivably, you might be getting UTF-8 data, but the string class may *think* it's ANSI? – 15ee8f99-57ff-4f92-890c-b56153 Jan 09 '14 at 20:17
  • I may stick with the ugly and simple way for now. I am already using a method or two to clean different "aspects" of the data I am pulling back, so this could be added. It's not the best solution, may even be considered a Band Aid, but for what I am trying to accomplish, it might work. – Ron Saylor Jan 09 '14 at 20:25
  • @RonS We did that replace gimmick in production code last month and cried all the way to the bank. If you can be reasonably confident you're not going to be getting anything but the usual pasted-from-Word stuff, it's a livable compromise. – 15ee8f99-57ff-4f92-890c-b56153 Jan 09 '14 at 20:58
  • My main concern is whether or not other "smart-quote characters" will pop up in the future, but this is the only one that showed up in nearly 3,000 rows of data. – Ron Saylor Jan 09 '14 at 21:03
  • @RonS I'm guessing there won't be [anything new in that department any time soon](http://en.wikipedia.org/wiki/Quotation_mark_glyphs). Unless MS does something much crazier than usual in a new version of Word. – 15ee8f99-57ff-4f92-890c-b56153 Jan 10 '14 at 14:08
0

’ is the UTF-8 byte stream for character displayed as ANSI characters with Windows 1252 code page.

UltraEdit should have no problem to detect the created text file on opening to be encoded in UTF-8 and display it correct.

See my answer at bad character encoding after xsl 1.0 transform for details on how auto-detection of UTF-8 encoding works in UltraEdit and what you can do to open a UTF-8 encoded file if auto-detection is not enabled in configuration (Advanced - Configuration - File Handling - Unicode/UTF-8 detection) or fails when first UTF-8 character is not within first 64 KB.

You could help text editors on detecting UTF-8 encoding for the file by writing into the file first the 3 bytes 0xEF 0xBB 0xBF displayed as ANSI string as  before writing the lines of the data list into the text file. 0xEF 0xBB 0xBF is the byte order marker (BOM) for a file encoded in UTF-8 which is recognized by text editors, but not displayed.

Character is also available in code page Windows 1252 (hexadecimal value 0x92) and could be therefore also stored in the text file with a conversion from UTF-8 to ANSI. But the data list may contain also characters from Unicode table not available in system code page and therefore it is better to create the file as UTF-8 encoded text file and not as ANSI text file.

Community
  • 1
  • 1
Mofi
  • 46,139
  • 17
  • 80
  • 143