0

I've already looked at this and it did not solve my issue https://stackoverflow.com/questions/51079664/c-sharp-error-with-exceldatareader

I've tried building a method that reads an XLS file and converts it to string[] But I get an error when trying to run it: ExcelDataReader.Exceptions.HeaderException: Invalid file signature.

I have tried running it with XLSX and it works fine

The files that I am using have worked before

Note. I have run the same method that worked with XLS before, so I'm confused as to why this error is occurring.(using ExcelDataReader Version 3.6.0)

Here is the code:


private static List<string[]> GetExcelRecords(string path, bool hasHeaders)
        {
            var records = new List<string[]>();
            using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
            {
                using (var reader = ExcelReaderFactory.CreateReader(stream))
                {
                    var sheetFile = reader.AsDataSet().Tables[0];

                    for (int i = 0; i < sheetFile.Rows.Count; i++)
                    {
                        var record = sheetFile.Rows[i];
                        if (hasHeaders)
                        {
                            hasHeaders = false;
                            continue;
                        }
                        var row = record.ItemArray.Select(o => o.ToString()).ToArray();
                        records.Add(row);
                    }
                }
            }
            return records;
        }

The exception occurs on line 4

I have tried using ExcelReaderFactory.CreateBinaryReader and ExcelReaderFactory.CreateOpenXlmReader

Daedalus
  • 1
  • 4
  • 1
    The xls format was deprecated more than 15 years ago. Lots of newer libraries no longer support it. – Joel Coehoorn Dec 02 '22 at 18:58
  • @JoelCoehoorn I have used this same method, library, and version previously and it worked without exception. – Daedalus Dec 02 '22 at 19:05
  • There are variations of .xls. A file could also be defective. You could look into the repo at https://github.com/ExcelDataReader/ExcelDataReader. There are a number of conditions where you'll find `throw new HeaderException(Errors.ErrorHeaderSignature);`. One of those conditions is true. – Scott Hannen Dec 02 '22 at 20:54

2 Answers2

1

This is more for fun (so community wiki), but here's how I'd write it:

private static IEnumerable<string[]> GetExcelRecords(string path, int headerRowsToSkip = 0)
{
    using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while(headerRowsToSkip-- > 0 && reader.Read()) {} //intentionally empty
        while(reader.Read())
        {
             object[] temp = new object[reader.FieldCount];
             reader.GetValues(temp);
             yield return temp.Select(o => o.ToString()).ToArray();
        }
    }
}

I'm not happy with it yet, specifically this part:

object[] temp = new object[reader.FieldCount];
reader.GetValues(temp);
yield return temp.Select(o => o.ToString()).ToArray();

The problem is we end up with three copies of the data in each iteration: the copy included in the reader, the copy in the object[], and the copy in the string[].

It's also worth mentioning that, thanks to cultural/internationalization issues, converting back and forth between strings and date/numeric values is among the slowest things we do in a computer on a regular basis. If the ExcelDataReader is respecting type information in the sheet and giving you numeric and date values, it is usually a mistake that costs a LOT more performance than you realize to convert everything to string.

We can't avoid copying the data once out of the reader, so we're gonna need two copies no matter what. With that in mind, I think this might be better:

object[] temp = new object[reader.FieldCount];
reader.GetValues(temp);
yield return temp;

Then we also need to change the method signature. This leaves us returning a set ofobject[]s, which is still less than ideal.

The other option is even more transformational: use generics and functional programming concepts to create an actual typed object, instead of just an array:

private static IEnumerable<T> GetExcelRecords<T>(string path, Func<IDataRecord, T> transform, int headerRowsToSkip = 0)
{
    using (var stream = File.Open(path, FileMode.Open, FileAccess.Read))
    using (var reader = ExcelReaderFactory.CreateReader(stream))
    {
        while(headerRowsToSkip-- > 0 && reader.Read()) {} //intentionally empty
        while(reader.Read())
        {
             yield return transform(reader);
        }
    }
}

This punts the hard work to a lambda expression. To explain how to use it, let's say you have a sheet with one header row and columns (in order) for an integer ID, a Date, a decimal price, a double quantity, and a string description. You might call the method like this:

var rows = GetExcelRecords("file path", row => 
    {
       return (row.GetInt32(0), row.GetDateTime(1), row.GetDecimal(2), row.GetDouble(3), row.GetString(4) );
    }, 1);

That used a ValueTuple to avoid declaring a class, while still preserving the original data types and avoiding two (2!) array allocations for every single row. Since we're using IEnumerable<> instead of List<> it also means we only need to keep one row in memory at a time.

I think we've ended up in a good place. The trick I'm using to read past the header rows is maybe a little too clever — relying on boolean short circuit ordering, an operator precedence subtlety, and an empty while loop all in the same line — but I like symmetry here.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
-1

The problem appears that my laptop is corrupting XLS files (I'm unsure why) The error isn't due to the file type, but instead the fact that my files are corrupted.

Daedalus
  • 1
  • 4
  • I had fun playing with this code in my CW answer. You should consider adding the final example as an additional overload option on top of what you're still doing. It was typed directly into the reply window, so may still have a minor syntax issue somewhere, but I'm confident it can perform significantly better while also adding features and perhaps even simplifying surrounding code by avoiding needing to reparse values. – Joel Coehoorn Dec 02 '22 at 21:41
  • @JoelCoehoorn Thank you for the tips. I think I might just try that. I think it'll be good practice to try more performant solutions as you mentioned. – Daedalus Dec 02 '22 at 21:51