6

I need to parse a utf8 encoded csv. After conversion i just saw that the problem is with the BOM () character at the beginging. I cannot create a csv avoiding the BOM with utf8 encoding as i need to parse it even if it is utf8 encoded.

Any one please tell me how can i remove the BOM () character from a csv using c#.net..

Update : I have added my code to read the csv headers since im getting the BOM at the beginning of the file.

 string CSVConnectionString = "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" + ConfigurationSettings.AppSettings["CSVFolder"].ToString() + ";Extensions=asc,csv,tab,txt;Persist Security Info=False;";

        using (OdbcConnection Connection = new OdbcConnection(CSVConnectionString))
        {
            List<string> CSVHeaders = new List<string>();

            string SelectQuery = string.Format(@"SELECT TOP 1 * FROM [{0}]", CSVFileName);

            OdbcCommand Command = new OdbcCommand(SelectQuery, Connection);

            Connection.Open();

            OdbcDataReader Reader = Command.ExecuteReader(System.Data.CommandBehavior.CloseConnection);

            int ColumnCount = Reader.FieldCount;

            for (int column = 0; column < ColumnCount; column++)
            {
                CSVHeaders.Add(Reader.GetName(column));
            }

            return CSVHeaders;
        }
Harun
  • 5,109
  • 4
  • 38
  • 60
  • 2
    You don't need to remove the BOM, you just need to read the file correctly. What does your code look like? – Jeff Mercado Jun 07 '11 at 05:38
  • @ Jeff, I have edited my question with the code – Harun Jun 07 '11 at 05:57
  • I couldn't read the file correctly because it was being read as part of an upload into Solr. I needed to emit the file without the BOM as in Simon's new UTF8Encoding(false) – Brian Leeming Sep 06 '12 at 15:12

3 Answers3

4

Actually, C# can read UTF-8 encoded files containing a BOM just fine. It's the broken CSV text driver you're using that's actually causing the problem. I'd recommend one of the other CSV reading solutions from this answer.

Community
  • 1
  • 1
Daniel Pryden
  • 59,486
  • 16
  • 97
  • 135
  • if you really want to use the csvtext driver in your code snippet, you may use a `FileReader` and `FileWriter` combination to create an `Path.GetTempFilename()` copy of the file without the BOMs – eFloh Jun 07 '11 at 06:54
  • This works great... Thanks for your answer. If i were not done till now i would definitely used this. I can't re-run my full test cases after using the new external dll.. – Harun Jun 07 '11 at 07:38
  • @Harun: Not sure what you mean by "new external DLL"... you mean that you can't add new references to this project? In that case, perhaps you can use the [Microsoft.VisualBasic.FileIO.TextFieldParser](http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx), since it's part of the .NET Framework. – Daniel Pryden Jun 07 '11 at 15:36
4

Here is a function that does this:

    public static void SaveAsUTF8WithoutByteOrderMark(string fileName)
    {
        SaveAsUTF8WithoutByteOrderMark(fileName, null);
    }

    public static void SaveAsUTF8WithoutByteOrderMark(string fileName, Encoding encoding)
    {
        if (fileName == null)
            throw new ArgumentNullException("fileName");

        if (encoding == null)
        {
            encoding = Encoding.Default;
        }

        File.WriteAllText(fileName, File.ReadAllText(fileName, encoding), new UTF8Encoding(false));
    }
Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • Does this suitable if the file is not encoded or encoded with some other? I mean, should i use this only for utf8 encoded files. – Harun Jun 07 '11 at 06:55
  • I wouldn't recommend this way for big files. – svick Jun 07 '11 at 06:57
  • @Harun - No, this function works only on UTF8 input files, or if the input file contains only ASCII characters. – Simon Mourier Jun 07 '11 at 07:02
  • @Simon, So i need to treat the files with out utf8 encoding seperately, right? – Harun Jun 07 '11 at 07:06
  • 1
    @Harun - yes, and to ease that, I have added an overload to be able to specify the input file encoding. By default, it's the Default encoding (ANSI), but if the input file is UTF8, you can do SaveAsUTF8WithoutByteOrderMark(file, Encoding.UTF8) – Simon Mourier Jun 07 '11 at 07:11
  • @svick - please post your answer – Simon Mourier Jun 07 '11 at 07:11
  • @simon, Ok.. thanks for the answer. Could you please suggest the other encodings that should be taken in to consideration in case of a csv file? Now i just encountered the problem with utf8. – Harun Jun 07 '11 at 07:21
  • @Simon, I can't point out disadvantages of an answer unless I write a better one? – svick Jun 07 '11 at 07:36
  • @svick - the term "recommend" certainly means 1) there are problems with this solution, and 2) you envisage another solution that fixes it, so you could share it with us. – Simon Mourier Jun 07 '11 at 08:20
  • @simon, Ok.. thanks for the answer. Could you please suggest the other encodings that should be taken in to consideration in case of a csv file? Now i just encountered the problem with utf8. – Harun Jun 07 '11 at 08:24
2

Instead of changing horses (use another .csv driver) or help the given horse by pulling the wagon yourself (change the encoding), you should tell the horse (the standard ODBC Text driver) what it needs to know to do the job by adding a schema.ini file:

[withbomgood.txt]
Format=TabDelimited
ColNameHeader=True
CharacterSet=65001
Col1=FrsColümn CHAR

to define the format of withbomgood.txt:

FrsColümn
whätever

which is an exact copy of withbombad.txt; both files have a BOM:

FrsColümn
whätever

If you now call a slightly modified copy

static void Harun00(string CSVFileName)
{
    string CSVFilePath = @"E:\trials\SoTrials\answers\6260911\data";
    string CSVConnectionString = 
        "Driver={Microsoft Text Driver (*.txt; *.csv)};Dbq=" + 
        CSVFilePath +
        ";Extensions=asc,csv,tab,txt;Persist Security Info=False;";

    using (OdbcConnection Connection = new OdbcConnection(CSVConnectionString))
    {
        List<string> CSVHeaders = new List<string>();

        string SelectQuery = string.Format(@"SELECT TOP 1 * FROM [{0}]", CSVFileName);

        OdbcCommand Command = new OdbcCommand(SelectQuery, Connection);

        Connection.Open();

        OdbcDataReader Reader = Command.ExecuteReader(System.Data.CommandBehavior.CloseConnection);

        int ColumnCount = Reader.FieldCount;

        for (int column = 0; column < ColumnCount; column++)
        {
            CSVHeaders.Add(Reader.GetName(column));
        }

        Console.WriteLine(CSVHeaders[0]);
    }
}

of your code twice:

static void Main(string[] args)
{
    Harun00("withbombad.txt");
    Harun00("withbomgood.txt");
}

you get:

FrsColümn
FrsColümn
Press any key to continue . . .

which proves that the driver will read an UTF8 with BOM file correctly and without any further ADO if you follow the rule: define your csv tables in a schema.ini file.

Ekkehard.Horner
  • 38,498
  • 2
  • 45
  • 96
  • Although cryptic, this answer actually worked for me. To clarify, I specified all columns using the `Col#={ColumnName} {Type}` format, except that for the first column I specified `CHAR` for the Type, even though in my case it was really a `Long`. Reference [http://msdn.microsoft.com/en-us/library/windows/desktop/ms709353(v=vs.85).aspx]MSDN. – Vic Oct 23 '12 at 23:29