Convert from ANSI to UTF-8 Encoding

Question

I have an ASP.Net 4.0 web application, written in C#, and one of the things it does is write a file to the file system based on data pulled from a SQL Server 2012 query. Sometimes the user input contains characters from French, frequently cut and pasted from Microsoft Word and therefore in ANSI encoding. This file my web application creates is then loaded into another program via code outside of my control, i.e., not in my web application. The problem is that this second program requires UTF-8 encoding. I've written code to convert my program's output to UTF-8, but it's still not loading correctly, so I think I'm doing something wrong. Here's my code:

protected void writeToClientFile(DataSet ClientGenl, DataSet ClientBus, DataSet ClientBill)
    {
        FileStream fileStream = null;
        string fileName = "ClientTest.txt";
        string pathName = ConfigurationSettings.AppSettings["EliteFilePath"].ToString();
        try
        {
            using (new KLClassLibrary.Impersonator(proxyaccount, domain, password))
            {
                fileStream = OpenASAP(pathName + fileName, 10);
                using (TextWriter tw = new StreamWriter(fileStream))
                {
                    foreach (DataRow rowGeneral in ClientGenl.Tables[0].Rows)
                    {
                        string fileTextGeneral = "CLNUM:" + rowGeneral["clnum"].ToString().toEliteInput();
                        byte[] originalBytes = Encoding.Default.GetBytes(fileTextGeneral);
                        byte[] convertedBytes = Encoding.Convert(Encoding.Default, Encoding.UTF8, originalBytes);
                        char[] convertedChars = new char[Encoding.UTF8.GetCharCount(convertedBytes, 0, convertedBytes.Length)];
                        Encoding.UTF8.GetChars(convertedBytes, 0, convertedBytes.Length, convertedChars, 0);
                        string convertedString = new string(convertedChars);
                        tw.WriteLine(convertedString);
                    }
                }
            }
        }
        catch (Exception ex)
        {
            throw ex;
        }
        finally
        {
            if (fileStream != null)
                fileStream.Dispose();
        }
    }
FileStream OpenASAP(string path, int maxTries)
    {
        FileStream fs = null;
        bool lastResult = false;
        int count = 0;
        while ((lastResult == false) && count < maxTries)
        {
            lastResult = TryOpen(path, out fs);
            Thread.Sleep(100);
            count++;
        }
        if (!lastResult || count >= maxTries)
        {
            throw new Exception("The file is being written to");
        }
        return fs;
    }
bool TryOpen(string FileWithPath, out FileStream fs)
    {
        try
        {
            fs = File.Open(FileWithPath, FileMode.Append, FileAccess.Write, FileShare.None);
            return true;
        }
        catch (Exception ex)
        {
            fs = null;
            return false;
        }
    }

What kind of output does `rowGeneral["clnum"].ToString().toEliteInput()` actually _give_? Without knowing that, there is no way to solve this issue. — Nyerguds, May 07 '18 at 16:42
Also note... Windows clipboard on _all_ modern Windows version gives UTF-8 text, and if you are using C# to fetch that, this is even irrelevant since you just get it as `String` object automatically. And, don't use `Encoding.Default`, it does not do what you seem to think it does. It does not mean "ansi". It means "use the local encoding of whatever language happens to be configured on this PC", which may go _very_ wrong if you ever run it on something that happens to _not_ be set to US-English. — Nyerguds, May 07 '18 at 16:45
toEliteInput is an Extension Method that returns a string. Is there a way to accurately determine what its encoding is? Thanks. — Melanie, May 07 '18 at 20:47
Yes, but I'm trying to determine what the actual _problem_ is, which requires seeing what your "wrong" output is. There are several ways in which encoded text can be corrupted, and the approach to fixing it depends on what you actually _get_. You never showed it. I just want an example of the kind of thing you get back from that query. Anyway, [this answer](https://stackoverflow.com/a/48460277/395685) should help you on your way when it comes to detecting encodings. — Nyerguds, May 08 '18 at 10:00
Thanks @Nyerguds. When I output to a text file using the above code, I get something that looks like this: ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑª°®©¥ãÃ€. This is what I've input and saved to the database. — Melanie, May 08 '18 at 20:43
So, then... you get text exactly like you want? Then the only thing you need to do is convert that to UTF-8 bytes. There's no "from" conversion if your string contains the correct data. — Nyerguds, May 08 '18 at 20:55
That output is very subjective though... you should view it in a text editor that can actually tell you what text encoding you're looking at. — Nyerguds, May 08 '18 at 20:57
When I load ÇüéâäàåçêëèïîìÄÅÉæÆôöòûùÿÖÜø£Ø×ƒáíóúñÑª°®©¥ãÃ€ into the 3rd party application, it displays as Ã‡Ã¼Ã©Ã¢Ã¤Ã Ã¥Ã§ÃªÃ«Ã¨Ã¯Ã®Ã¬Ã„Ã…Ã‰Ã¦Ã†Ã´Ã¶Ã²Ã»Ã¹Ã¿Ã–ÃœÃ¸Â£Ã˜. When I look at my original string in Notepad++ it says it's already UTF-8 encoding. The person in charge of the 3rd party app swears it requires UTF-8 encoding. Is she just wrong about what encoding I need to use? — Melanie, May 10 '18 at 16:29
if it displays _as_ that when _actually viewed_ as UTF-8 then you're double-encoding it. — Nyerguds, May 10 '18 at 17:00
And yet if I don't encode it, the 3rd party app doesn't display it properly either. — Melanie, May 11 '18 at 15:09
I'm going to go back to the 3rd party app's tech support. I think there's something here I'm missing. — Melanie, May 11 '18 at 20:48
I just wanted to post here that we discovered, quite by accident, that what the 3rd party vendor had told us was wrong. The application does not accept UTF-8 encoding. It requires only plain ANSI, not extended characters. So there's really no way to get those characters into the app. All this work for nothing. Sorry to have wasted your time, but I appreciate your help. — Melanie, May 25 '18 at 18:56

Jimi · Answer 1 · 2018-05-03T19:56:07.017

1

You could derive the original encoding from the StreamReader.CurrentEncoding, and then convert the Encoding using Encoding.Convert(), specifying the derived encoding as the source encoding and Encoding.UTF8 as the destination encoding.

This way, you let the StreamReader decide which encoding fits the content of the source file.
It will probably be Unicode (UTF16 in Windows), converted from the Default encoding.

string TextDestinaton = string.Empty;

using (FileStream FileOrigin = new FileStream(@"[SomeSourceFile]", 
                                   FileMode.Open, FileAccess.Read, FileShare.None))
using (StreamReader orgReader = new StreamReader(FileOrigin))
{
    Encoding OriginalEncoding = Encoding.GetEncoding(orgReader.CurrentEncoding.CodePage);
    byte[] OriginalBytes = OriginalEncoding.GetBytes(orgReader.ReadToEnd());
    byte[] DestinationBytes = Encoding.Convert(OriginalEncoding, Encoding.UTF8, OriginalBytes, 0, OriginalBytes.Length);

    using (MemoryStream memstream = new MemoryStream(DestinationBytes, 0, DestinationBytes.Length))
    using (StreamReader destReader = new StreamReader(memstream, Encoding.UTF8))
    {
        memstream.Position = 0;
        TextDestinaton = destReader.ReadToEnd();
    };
}

As a note: if the original text contained characters that did not fit the Local Encoding (when the text was acquired), and no specific Encoding was used to preserve the source CodePage mapping, the source text could be compromised.

edited May 03 '18 at 19:56

answered May 02 '18 at 03:26

Jimi

29,621
8
43
61

I tried your suggestion, but it's still not giving me the correct output. One thing I didn't mention is that the data is being pulled from a SQL Server 2012 query and then manipulated by my code. I don't think this changes anything, but I'll also edit my original question to include this. – Melanie May 07 '18 at 15:57
@Nyerguds Well, everything is unreliable if you don't know the original encoding (and whether the original CodePage mapping was preserved). This code just tests the possiblity to map a text (what I thought was a text file) file to Unicode using the local CodePage. It's true that StreamReader.CurrentEncoding should be checked after the first read(), but if the text can be mapped as described, you'll get positive results anyway. If it can't, you'll have to guess or find out how that text was encoded when stored locally. – Jimi May 07 '18 at 21:43
1

@Melanie Do you know how the original text was stored in SQL Server? Using `varchar` or `nvarchar`? If it's `nvarchar`, it should be UCS2/UTF16, if it's `varchar`, it could be ISO-8859-1 (CP 1252). If you can do it, check the `COLLATIONPROPERTY`, and see what CodePage or LCID was used. – Jimi May 07 '18 at 21:50
@Jimi As far as I know, `StreamReader`'s encoding detection is _purely based on byte order marks_. It will not automatically detect a UTF encoding if there are no specific indicator bytes at the start of the text. – Nyerguds May 08 '18 at 09:57
@Nyerguds Correct. That's why here I'm not trying to detect a specific format, performing a partial reading to let the StreamReader identify a [improbable] BOM. I'm testing whether the source text can be translated to Unicode preserving the original codepage mappings. If it can, the text (when acquired) has been mapped correctly or the Local one is compatible. If it cannot, the original text has already been transcoded to a, hopefully, wider format or the original format is lost. Since now I know it comes from a SQL DB, I suggested to directly test the results with its standard encodings. – Jimi May 08 '18 at 10:29
@Jimi - the collation of the database is SQL_Latin1_General_CP1_CI_AS and the data is stored in various varchar fields. Thanks for your help. – Melanie May 08 '18 at 20:22
@Melanie That is ISO-8859-1 (Latin1). Let me know whether using that encoder directly fixed the problem. I'm interested in the results, too. – Jimi May 09 '18 at 02:27
@Jimi, when I look at the encoding available using the C# Encoding object, I don't see anything like ISO-8859-1 or Latin1. Which Encoding property does that correspond to? – Melanie May 10 '18 at 16:35
@Melanie Use the CodePage: `Encoding.GetEncoding(28591)` or CodePage 1252. See this MSDN document: [Code Page Identifiers](https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx). – Jimi May 10 '18 at 16:41
@Melanie You can also use the litterals: `Encoding encoding = Encoding.GetEncoding("iso-8859-1");` – Jimi May 10 '18 at 16:51
@Jimi, thank you so much for this information. Unfortunately, setting the original encoding to code page 1252 doesn't help. At this point, I'm going back to the 3rd party app's tech support; I think there's something here that I'm missing. I'll keep you posted, though, so you know what the resolution is (if there ever is one). – Melanie May 11 '18 at 20:38
@Melanie Well, the actual CodePage is 28591, 1252 is the Windows-1252 (Windows specific). Anyway, I'll be glad to hear how it goes :) – Jimi May 11 '18 at 20:46
I just wanted to post here that we discovered, quite by accident, that what the 3rd party vendor had told us was wrong. The application does not accept UTF-8 encoding. It requires only plain ANSI, not extended characters. So there's really no way to get those characters into the app. All this work for nothing. Sorry to have wasted your time, but I appreciate your help. – Melanie May 25 '18 at 18:56
@Melanie ANSI or ASCII? `Iso-8859-1` is an ANSI encoding standard. It can hold all Latin1 area characters plus more. – Jimi May 25 '18 at 19:02
I believe it's ANSI, but it seems to be limited to the "standard" English characters. It's hard to say, since it's a third party app that the vendor no longer supports, so it's a black box. – Melanie May 29 '18 at 14:09

Convert from ANSI to UTF-8 Encoding

1 Answers1