Parsing what is supposed to be a tab-delimited file in C#

Question

B"H

I have a file that should be tab-delimited. Excel opens it fine without a problem. but when I try File.ReadAllText() I can't get a decent representation. The best I can do is with UTF8 which returns most of the data, but the fist line is all messed up and some tabs in the rest of the document are missing.

Here is the first line when read using UTF8: �\u0010\b\u0004c\u0004\0\0�\u0006�\u0003\0\0\0\0!�A\u0004\0\0\0\0\0\0\0\0\u0001\0\0\0ID\0\0\0\0\0\0C\0\0\0\0\u0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0NAME\0\0\0\0\0\0\0C\0\0\0\0\u001e\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ADDR\0\0\0\0\0\0\0C\0\0\0\0(\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ADDRC\0\0\0\0\0\0C\0\0\0\0(\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0CITY\0\0\0\0\0\0\0C\0\0\0\0\u001e\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0STATE\0\0\0\0\0\0C\0\0\0\0\u0014\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0ZIP\0\0\0\0\0\0\0L\0\0\0\0\u0001\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r

And here are the first few bytes as displayed when opened in Notepad: õc ÁŸ !£A

Does anyone recognize that encoding?

StreamReader.CurrentEncoding only works for the standard encodings. These files are obviously not standard. — Rabbi, Aug 05 '16 at 18:07
@peter-duniho This question is not a duplicate. It is not even related to the question that you posted. That question asks how you would programmaticly find the encoding from the small list of standard encodings. I don't need programmatic detection. I need help identifying this particular encoding. — Rabbi, Aug 05 '16 at 18:16
_" I don't need programmatic detection"_ -- then your question isn't even a programming question and doesn't belong on Stack Overflow. Try superuser.stackexchange.com instead. Or programmatically find out the encoding, per the marked duplicate (why do you care _how_ you find out the encoding, as long as you find it out?) — Peter Duniho, Aug 05 '16 at 19:00
All of the answers in that thread (rightfully) assume that you are trying to find the encoding from the limited list directly supported by the .net libraries. I wouldn't mind finding it out programaticaly if that is the solution people come up with. As long as the answer is not limited to those few encodings. That question is about how do do a SIMPLE find using code. My question is how to find an obscure encoding, whether the solution uses code or not. I then need to write some code to do the translation. — Rabbi, Aug 05 '16 at 19:14
@PeterDuniho Hi. This question is now edited so it no longer gives the impression of being a duplicate of [How can I detect the encoding/codepage of a text file](http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file). Can you remove the duplicate mark now? — , Aug 06 '16 at 11:13
This is clearly not a text file and so it's not about an encoding and not a duplicate. — H H, Aug 08 '16 at 06:14

score 1 · Answer 1 · 2016-08-06T07:30:09.913

1

First, let's check the possibility of having an encoding-related problem, which is the bane of plain-text files. Use Microsoft Word or Notepad++ to discover the encoding by previewing each and every one.

In Microsoft Word, go to menu, "Options", "Advanced", "General" subsection and put a check besides "Confirm file format conversion on open". Once done, click OK button. Then, open the file in Microsoft Word. Preview each encoding until you find one that shows everything correctly.

Once you found the encoding, use the StreamReader class of .NET Framework to open the file with that encoding.

edited Aug 06 '16 at 07:30

answered Aug 05 '16 at 17:12

Thank you. Word and NotePad++ were great ideas. neither of them could open the file correctly. Each one gives numerous options of encodings to try, but none of them display the file correctly. Now Excel does display the file fine. The issue is that I have a bunch of files like this so I need to figure out what encoding it is, so that I can grammatically read these files. And I couldn't find a place in Excel that would tell me what encoding it was using to open the file. – Rabbi Aug 05 '16 at 18:01
@Rabbi: That's certainly weird. It is possible that what you have is actually a binary file that excel recognizes and is not a plain text file at all. You can try exporting them from Excel into an actual tab-separated file. Also, I can analyze one of those files for you, although, you might not want to do that for privacy reasons. – Aug 06 '16 at 07:24

blaze_125 · Answer 2 · 2016-08-05T17:18:21.077

This way of getting the file encoding has been good to me so far.

http://weblog.west-wind.com/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader

    /// <summary>
    /// Detects the byte order mark of a file and returns
    /// http://weblog.west-wind.com/posts/2007/Nov/28/Detecting-Text-Encoding-for-StreamReader
    /// an appropriate encoding for the file.
    /// </summary>
    /// <param name="srcFile"></param>
    /// <returns></returns>
    public static Encoding GetFileEncoding(string srcFile)
    {
        // *** Use Default of Encoding.Default (Ansi CodePage)
        Encoding enc = Encoding.Default;
        // *** Detect byte order mark if any - otherwise assume default
        byte[] buffer = new byte[5];
        FileStream file = new FileStream(srcFile, FileMode.Open);
        file.Read(buffer, 0, 5);
        file.Close();

        if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
            enc = Encoding.UTF8;
        else if (buffer[0] == 0xfe && buffer[1] == 0xff)
            enc = Encoding.Unicode;
        else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
            enc = Encoding.UTF32;
        else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
            enc = Encoding.UTF7;
        return enc;
    }

I use it like this

//To read
Encoding currentFileEnc = GetFileEncoding(TheFile);
using (StreamReader sr = new StreamReader(TheFile, currentFileEnc))
{
    //Blah blah blah
}

//To write back
using (StreamWriter sw = new StreamWriter(TempFilePath, false, currentFileEnc))
{
    //blah blah blah
}

Thank you. As I said in the question. These files are not in any of the standard encodings. I have tried all of the regulars, and I am not getting usable files. On the other hand, Excel opens them fine. I just need to know how to do that grammatically. Once I have identified this encoding I won't need to check it grammatically - I will just need to write (or find) a conversion function. — Rabbi, Aug 05 '16 at 18:04
The title of your question is "How can you find the encoding of a file c#" — blaze_125, Aug 05 '16 at 18:11
Yes I need a way to find the encoding of this particular file. It's not a standard encoding. Please read the body of the question. — Rabbi, Aug 05 '16 at 18:18
vEdit, and EditPlus are 2 text editors we use to "detect" encoding. But that's not c#. — blaze_125, Aug 05 '16 at 18:20

Parsing what is supposed to be a tab-delimited file in C#

2 Answers2