5

I'm reading a CSV file and the records are recorded as a string[]. I want to take each record and convert it into a custom object.

T GetMyObject<T>();

Currently I'm doing this through reflection which is really slow. I'm testing with a 515 Meg file with several million records. It takes under 10 seconds to parse. It takes under 20 seconds to create the custom objects using manual conversions with Convert.ToSomeType but around 4 minutes to do the conversion to the objects through reflection.

What is a good way to handle this automatically?

It seems a lot of time is spent in the PropertyInfo.SetValue method. I tried caching the properties MethodInfo setter and using that instead, but it was actually slower.

I have also tried converting that into a delegate like the great Jon Skeet suggested here: Improving performance reflection , what alternatives should I consider, but the problem is I don't know what the property type is ahead of time. I'm able to get the delegate

var myObject = Activator.CreateInstance<T>();
foreach( var property in typeof( T ).GetProperties() )
{
    var d = Delegate.CreateDelegate( typeof( Action<,> )
    .MakeGenericType( typeof( T ), property.PropertyType ), property.GetSetMethod() );
}

The problem here is I can't cast the delegate into a concrete type like Action<T, int>, because the property type of int isn't known ahead of time.

Community
  • 1
  • 1
Josh Close
  • 22,935
  • 13
  • 92
  • 140

2 Answers2

7

The first thing I'd say is write some sample code manually that tells you what the absolute best case you can expect is - see if your current code is worth fixing.

If you are using PropertyInfo.SetValue etc, then absolutely you can make it quicker, even with juts object - HyperDescriptor might be a good start (it is significantly faster than raw reflection, but without making the code any more complicated).

For optimal performance, dynamic IL methods are the way to go (precompiled once); in 2.0/3.0, maybe DynamicMethod, but in 3.5 I'd favor Expression (with Compile()). Let me know if you want more detail?


Implementation using Expression and CsvReader, that uses the column headers to provide the mapping (it invents some data along the same lines); it uses IEnumerable<T> as the return type to avoid having to buffer the data (since you seem to have quite a lot of it):

using System;
using System.Collections.Generic;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using LumenWorks.Framework.IO.Csv;
class Entity
{
    public string Name { get; set; }
    public DateTime DateOfBirth { get; set; }
    public int Id { get; set; }

}
static class Program {

    static void Main()
    {
        string path = "data.csv";
        InventData(path);

        int count = 0;
        foreach (Entity obj in Read<Entity>(path))
        {
            count++;
        }
        Console.WriteLine(count);
    }
    static IEnumerable<T> Read<T>(string path)
        where T : class, new()
    {
        using (TextReader source = File.OpenText(path))
        using (CsvReader reader = new CsvReader(source,true,delimiter)) {

            string[] headers = reader.GetFieldHeaders();
            Type type = typeof(T);
            List<MemberBinding> bindings = new List<MemberBinding>();
            ParameterExpression param = Expression.Parameter(typeof(CsvReader), "row");
            MethodInfo method = typeof(CsvReader).GetProperty("Item",new [] {typeof(int)}).GetGetMethod();
            Expression invariantCulture = Expression.Constant(
                CultureInfo.InvariantCulture, typeof(IFormatProvider));
            for(int i = 0 ; i < headers.Length ; i++) {
                MemberInfo member = type.GetMember(headers[i]).Single();
                Type finalType;
                switch (member.MemberType)
                {
                    case MemberTypes.Field: finalType = ((FieldInfo)member).FieldType; break;
                    case MemberTypes.Property: finalType = ((PropertyInfo)member).PropertyType; break;
                    default: throw new NotSupportedException();
                }
                Expression val = Expression.Call(
                    param, method, Expression.Constant(i, typeof(int)));
                if (finalType != typeof(string))
                {
                    val = Expression.Call(
                        finalType, "Parse", null, val, invariantCulture);
                }
                bindings.Add(Expression.Bind(member, val));
            }

            Expression body = Expression.MemberInit(
                Expression.New(type), bindings);

            Func<CsvReader, T> func = Expression.Lambda<Func<CsvReader, T>>(body, param).Compile();
            while (reader.ReadNextRecord()) {
                yield return func(reader);
            }
        }
    }
    const char delimiter = '\t';
    static void InventData(string path)
    {
        Random rand = new Random(123456);
        using (TextWriter dest = File.CreateText(path))
        {
            dest.WriteLine("Id" + delimiter + "DateOfBirth" + delimiter + "Name");
            for (int i = 0; i < 10000; i++)
            {
                dest.Write(rand.Next(5000000));
                dest.Write(delimiter);
                dest.Write(new DateTime(
                    rand.Next(1960, 2010),
                    rand.Next(1, 13),
                    rand.Next(1, 28)).ToString(CultureInfo.InvariantCulture));
                dest.Write(delimiter);
                dest.Write("Fred");
                dest.WriteLine();
            }
            dest.Close();
        }
    }
}

Second version (see comments) that uses TypeConverter rather than Parse:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;
using LumenWorks.Framework.IO.Csv;
class Entity
{
    public string Name { get; set; }
    public DateTime DateOfBirth { get; set; }
    public int Id { get; set; }

}
static class Program
{

    static void Main()
    {
        string path = "data.csv";
        InventData(path);

        int count = 0;
        foreach (Entity obj in Read<Entity>(path))
        {
            count++;
        }
        Console.WriteLine(count);
    }
    static IEnumerable<T> Read<T>(string path)
        where T : class, new()
    {
        using (TextReader source = File.OpenText(path))
        using (CsvReader reader = new CsvReader(source, true, delimiter))
        {

            string[] headers = reader.GetFieldHeaders();
            Type type = typeof(T);
            List<MemberBinding> bindings = new List<MemberBinding>();
            ParameterExpression param = Expression.Parameter(typeof(CsvReader), "row");
            MethodInfo method = typeof(CsvReader).GetProperty("Item", new[] { typeof(int) }).GetGetMethod();

            var converters = new Dictionary<Type, ConstantExpression>();
            for (int i = 0; i < headers.Length; i++)
            {
                MemberInfo member = type.GetMember(headers[i]).Single();
                Type finalType;
                switch (member.MemberType)
                {
                    case MemberTypes.Field: finalType = ((FieldInfo)member).FieldType; break;
                    case MemberTypes.Property: finalType = ((PropertyInfo)member).PropertyType; break;
                    default: throw new NotSupportedException();
                }
                Expression val = Expression.Call(
                    param, method, Expression.Constant(i, typeof(int)));
                if (finalType != typeof(string))
                {
                    ConstantExpression converter;
                    if (!converters.TryGetValue(finalType, out converter))
                    {
                        converter = Expression.Constant(TypeDescriptor.GetConverter(finalType));
                        converters.Add(finalType, converter);
                    }
                    val = Expression.Convert(Expression.Call(converter, "ConvertFromInvariantString", null, val),
                        finalType);
                }
                bindings.Add(Expression.Bind(member, val));
            }

            Expression body = Expression.MemberInit(
                Expression.New(type), bindings);

            Func<CsvReader, T> func = Expression.Lambda<Func<CsvReader, T>>(body, param).Compile();
            while (reader.ReadNextRecord())
            {
                yield return func(reader);
            }
        }
    }
    const char delimiter = '\t';
    static void InventData(string path)
    {
        Random rand = new Random(123456);
        using (TextWriter dest = File.CreateText(path))
        {
            dest.WriteLine("Id" + delimiter + "DateOfBirth" + delimiter + "Name");
            for (int i = 0; i < 10000; i++)
            {
                dest.Write(rand.Next(5000000));
                dest.Write(delimiter);
                dest.Write(new DateTime(
                    rand.Next(1960, 2010),
                    rand.Next(1, 13),
                    rand.Next(1, 28)).ToString(CultureInfo.InvariantCulture));
                dest.Write(delimiter);
                dest.Write("Fred");
                dest.WriteLine();
            }
            dest.Close();
        }
    }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • You can't do assigment in expressions. – adrianm Jan 11 '10 at 20:00
  • For new objects you can (which is what we are doing here) - how do you think lambdas such as `new {Name = x.Name, Id = x.Id}` work. – Marc Gravell Jan 11 '10 at 20:18
  • I see how using DynamicMethod would work from Darin Dimitrov's link. How would you use Expression to do this? I can see creating a mapping file where you say what field is mapped to what property by doing Map( m => m.FirstName, "FirstName" ) like FluentNHibernate does. Is that what you were thinking, or something else? I'd really rather not want to create another file for this. If that's the case, using DynamicMethod would be better. – Josh Close Jan 11 '10 at 20:37
  • I did some sample code manually which I mentioned was under 20 seconds, so that would be my goal. – Josh Close Jan 11 '10 at 21:07
  • If you notice, there is lots of code there, but it only does the complex stuff *once*, compiling it into a `Func<,>` which it re-uses for the rows. – Marc Gravell Jan 11 '10 at 21:14
  • @Marc Gravell Yes, I see. I like it. A lot more readable than DynamicMethod emitting too. Doing this had very fast results also. I'm actually using TypeConverter's to do the type conversion which seems to be the slow down now. I may need to re-think that portion. – Josh Close Jan 11 '10 at 22:03
  • I'm very familiar with TypeConverter; that would be fine if "good enough is", perhaps with HyperDescriptor (since both use boxing). But if you need the fastest possible it is better to bypass these (albeit minor) overheads, and use things like `Parse`. Or mix and match ;-p – Marc Gravell Jan 11 '10 at 22:21
  • Yeah, I'm going to try and eliminate it where I can. Currently I'm just using it for everything, just to get everything working properly. Obviously, if the type is a string, then no conversion is needed, which I'm currently not even handling. I'm actually looking for TypeConverterAttribute on the properties and if one is specified, use that. – Josh Close Jan 11 '10 at 23:57
  • Ok. I have it implemented and it's pretty fast. I'm using parse if parse is available, otherwise grabbing the default type converter for the type, and nothing if string. The only problem I'm having now is if the type is Guid. Doing TypeDescriptor.GetConverter( property.PropertyType ) to get the converter, then getting the method by typeConverter.GetType().GetMethod( "ConvertFrom", new[] { typeof( string ) } ), then passing that into Expression.Call( Expression.New( typeConverter.GetType() ), convertFromMethod, fieldExpression ). When binding this I get the error "Argument types do not match". – Josh Close Jan 13 '10 at 06:03
  • In this case I would use the Guid ctor that accepts a string: `ConstructorInfo guidCtor = typeof(Guid).GetConstructor(new[] {typeof(string)});`, and use (for `Guid`) `val = Expression.New(guidCtor, val);` – Marc Gravell Jan 13 '10 at 08:12
  • I found that no type converters work. They all have the same "Argument types do not match" issue. What is the proper way to create a type converter and call ConvertFrom on it using Expressions? – Josh Close Jan 13 '10 at 15:55
  • Added `TypeConverter` example (in this case *all* type-converter, but you could mix and match easily enough) – Marc Gravell Jan 13 '10 at 19:31
  • Thanks a lot Marc! Is there a good place to learn Expressions? The comments on MSDN and intellisense don't really help explaining what things do. You seem to know them pretty well, which may just be from using them. :P – Josh Close Jan 13 '10 at 20:07
  • I've blogged about it a bit, including tricks for learning them: http://marcgravell.blogspot.com/search/label/expression (read bottom up) - but apparently the MSDN documentation for 4.0 is better. – Marc Gravell Jan 13 '10 at 20:37
  • @Marc Gravell You can view my implementation here http://github.com/JoshClose/CsvHelper/blob/master/src/CsvHelper/CsvReader.cs – Josh Close Jan 14 '10 at 05:14
1

You should make a DynamicMethod or an expression tree and build statically typed code at runtime.

This will incur a rather large setup cost, but no per-object overhead at all.
However, it's somewhat difficult to do, and will result in complicated code that is difficult to debug.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964