0

I am using CsvHelper library to write to a CSV file. However, some of my CSV files have varying encodings, causing random characters to appear in the written data. Below is a code sample illustrating how I am currently handling this in my project. How could I solve it?

 public async Task WriteAsync<T>(string path, T record)
        {

            bool containsNewLines = ContainsNewLines(path);
            using (var stream = File.Open(path, FileMode.Append))
            using (var writer = new StreamWriter(stream, Encoding.UTF8))
            using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
            {
                if (!containsNewLines)
                {
                    await csv.NextRecordAsync();
                }
                csv.WriteRecord(record);
                await csv.NextRecordAsync();
            }
        }

        private bool ContainsNewLines(string filePath)
        {
            using (var reader = new StreamReader(filePath))
            {
                string content = reader.ReadToEnd();
                return content.EndsWith(Environment.NewLine);
            }
        }
  • _"However, some of my CSV files have varying encodings"_ - how is that? You only seem to be writing UTF8? – Fildor Aug 07 '23 at 14:16
  • 3
    Your code specifies UTF8, not varying encodings. Are you trying to *append* to existing files? In that case *the application code* must guess what the correct encoding is and specify that instead of UTF8. Except for Unicode files, there's no way to tell what encoding a file uses unless you read as much content as possible (possibly the entire file) until you find a byte that's invalid in one encoding or another. If the entire file contains English letters except the very last ,you'll have to read the entire file to the end – Panagiotis Kanavos Aug 07 '23 at 14:16
  • 1
    You are also aware that _for each call of your `WriteAsync`_ you read the whole file into memory? When the only thing you want to know is if the last character is a newline? – Fildor Aug 07 '23 at 14:20
  • 2
    The real answer is to *not* use multiple encodings, but ensure everything is UTF8 (or UTF16, which contains an encoding identifier). Encoding-detection packages check for invalid bytes *and* for character distribution to guess the correct encoding for a file. Multiple encodings may map the same byte to different characters so even if you find 1 encoding that doesn't generate errors, you can't be certain it's the correct one – Panagiotis Kanavos Aug 07 '23 at 14:20
  • 1
    You can make an attempt to figure out the encoding. https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding. However, to be certain, you are going to have to ask the person uploading the file what the encoding of the file is. – David Specht Aug 07 '23 at 14:39

1 Answers1

2

The problem is, there is no fail-safe way to determine the encoding of a file. As Panagiotis Kanavos states, the best answer is to either require all the files be of a single encoding, say UTF8, or else have the creator of the file somehow give you the encoding of the file.

That said, it is possible to take a guess at the encoding. Here is a modification of Berthier Lemieux's answer for detecting file encoding. The method reads the whole file and either determines the encoding by the byte order mark or assumes it is UTF8 encoding. If the reader throws an exception while reading as UTF8, the method then defaults to your preferred ANSI encoding.

public Encoding DetectFileEncoding(string fileName, Encoding defaultEncoding)
{
    var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", 
            new EncoderExceptionFallback(), new DecoderExceptionFallback());
    using (var reader = new StreamReader(fileName, Utf8EncodingVerifier, 
            detectEncodingFromByteOrderMarks: true, bufferSize: 1024))
    {
        try
        {
            while (!reader.EndOfStream)
            {
                _ = reader.ReadLine();
            }
            return reader.CurrentEncoding;
        }
        catch (Exception)
        {
            // Failed to decode the file using the BOM/UTF8. 
            // return default ANSI encoding
            return defaultEncoding;
        }
    }
}

You can then use DetectFileEncoding to set the encoding for StreamWriter. If your files are not likely to be Latin1 (ISO-8859-1) encoding, then you can use the default encoding that works best for you.

public async Task WriteAsync<T>(string path, T record)
{
    bool containsNewLines = ContainsNewLines(path);
    Encoding fileEncoding = DetectFileEncoding(path, Encoding.Latin1);

    using (var stream = File.Open(path, FileMode.Append))
    using (var writer = new StreamWriter(stream, fileEncoding))
    using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
    {
        if (!containsNewLines)
        {
            await csv.NextRecordAsync();
        }
        csv.WriteRecord(record);
        await csv.NextRecordAsync();
    }
}
David Specht
  • 7,784
  • 1
  • 22
  • 30