0

I'm using CsvHelper to import a csv like so. Its being done line by line to deal with type conversion errors. I can then import the ones that worked and report the ones that fail. Its exceedingly fast when there are no errors, 1 second for 3000k lines. But if there are lots of errors, and lets say all the 3000k had type conversion issue, the time to complete is 60 seconds. Any advice on speeding up error handling?

csv.Read();
csv.ReadHeader();        
while (await csv.ReadAsync())
{
    try
    {
        var record = csv.GetRecord<CollectionQueue>();

    }
    catch (TypeConverterException conversionEx)
    {
        var errorMessage = $"Error converting value '{conversionEx.Text}' to type '{conversionEx.MemberMapData.Type.Name}' for field '{conversionEx.MemberMapData.Names[0]}'";

    }
}
Paul McTigue
  • 221
  • 2
  • 7
  • Throwing exceptions is known to be somewhat expensive, see [How expensive are exceptions in C#?](https://stackoverflow.com/q/891217) and [Exceptions and Performance](https://learn.microsoft.com/en-us/dotnet/standard/design-guidelines/exceptions-and-performance). The suggested alternative is to use a `TryParse()` method, but CsvHelper does not seem to have one currently, see [Question : best practise reading csv and displaying detailed errors #1276](https://github.com/JoshClose/CsvHelper/issues/1276). – dbc Dec 03 '22 at 16:21
  • Is there any way you could do a "quick reject" that can quickly determine that a row is malformed, that could apply to most of the 3000k bad lines? One simple quick reject would be to check the number of values, is that often the problem? If not, you may need to `TryParse()` each value manually. – dbc Dec 03 '22 at 16:38
  • Yes, using exception to get data is best avoided if speed is a concern but they error context data is very good in csvhelper. But something like TryGetRecord that could return a Result that contains the a value for success and an context for the error details might be a nice feature to add – Paul McTigue Dec 03 '22 at 23:29

1 Answers1

2

Using some of the ideas from @dbc, I have a couple of suggestions.

I noticed about a 30% increase in speed if I used Configuration.ReadingExceptionOccurred to catch the exception.

async void Main()
{
    List<Foo> records = new List<Foo>();
    
    for (int i = 0; i < 300000; i++)
    {
        var amount = (i + 1).ToString();
        
        if (i % 5 == 0)
        {
            amount = $"${i + 1}";
        }
        
        records.Add(new Foo {Id = i + 1, Amount = amount});
    }
    
    using (var writer = new StreamWriter(@"C:\Temp\TypeConverterException.csv"))
    using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
    {
        csv.WriteRecords(records);
    }
    
    var errors = new List<string>();
    
    var config = new CsvConfiguration(CultureInfo.InvariantCulture)
    {
        ReadingExceptionOccurred = args =>
        {
            if (args.Exception.GetType() == typeof(TypeConverterException))
            {
                var conversionEx = (TypeConverterException)args.Exception;

                errors.Add(
                 $"Error on line {conversionEx.Context.Parser.RawRow} converting value '{conversionEx.Text}' to type '{conversionEx.MemberMapData.Type.Name}' for field '{conversionEx.MemberMapData.Names[0]}'"
                );
                
                return false;
            } 
            else
            {
                return true;
            }
        }
    };
    
    using (var reader = new StreamReader(@"C:\Temp\TypeConverterException.csv"))
    using (var csv = new CsvReader(reader, config))
    {
        var start = DateTime.Now;
        csv.Read();
        csv.ReadHeader();
        
        var barRecords = new List<Bar>();
        
        while (await csv.ReadAsync())
        {           
            var record = csv.GetRecord<Bar>();
            
            if (record != null)
                barRecords.Add(record);         
        }
        
        var end = DateTime.Now;
        
        var difference = end - start;
        
        difference.TotalSeconds.Dump();
    }
}

public class Foo
{
    public int Id { get; set; }
    public string Amount { get; set; }
}

public class Bar
{
    public int Id { get; set; }
    public decimal Amount { get; set; }
}

However, it was considerably faster if I could use another class to import the records that had string for the properties that could have conversion errors and then manually did a TryParse() before converting to the final class that I wanted.

async void Main()
{
    List<Foo> records = new List<Foo>();
    
    for (int i = 0; i < 300000; i++)
    {
        var amount = (i + 1).ToString();
        
        if (i % 5 == 0)
        {
            amount = $"${i + 1}";
        }
        
        records.Add(new Foo {Id = i + 1, Amount = amount});
    }
    
    using (var writer = new StreamWriter(@"C:\Temp\TypeConverterException.csv"))
    using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
    {
        csv.WriteRecords(records);
    }
    
    var errors = new List<string>();
    
    using (var reader = new StreamReader(@"C:\Temp\TypeConverterException.csv"))
    using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
    {
        var start = DateTime.Now;
        
        var fooRecords = csv.GetRecordsAsync<Foo>();
        var barRecords = new List<Bar>();
        
        await foreach (var record in fooRecords)
        {
            var canParce = decimal.TryParse(record.Amount, out decimal barAmount);
            
            if (canParce)
            {
                barRecords.Add(new Bar { Id = record.Id, Amount = barAmount});
            } 
            else
            {
                errors.Add($"Error converting value '{record.Amount}' to decimal for field Amount");
            }
        }

        var end = DateTime.Now;

        var difference = end - start;

        difference.TotalSeconds.Dump();
    }
}

public class Foo
{
    public int Id { get; set; }
    public string Amount { get; set; }
}

public class Bar
{
    public int Id { get; set; }
    public decimal Amount { get; set; }
}
David Specht
  • 7,784
  • 1
  • 22
  • 30
  • That cross my mind too, using strings as for every properties and convert manually. I can use csv.TryGetField. That is fast but still requires to be manually mapped and converted. But it quick and works and will give a user enough of clue to where things are going wrong – Paul McTigue Dec 03 '22 at 23:44
  • I liked your idea of `csv.TryGetField`. You could just mark the fields with `[Ignore]` that could have a conversion issue. Use `csv.GetRecord` to get the rest of the fields and then you would just need to use `csv.TryGetField` for those that were ignored. However, I just tried it with `csv.TryGetField` and it was only slightly faster than my first example with `Configuration.ReadingExceptionOccurred`. – David Specht Dec 04 '22 at 04:17