Here's a rather overengineered but really fun generic method, where you give attributes to your class to match them to CSV column headers:
First step is to parse your CSV. There are a variety of methods out there, but my favourite is the TextFieldParser
that can be found in the Microsoft.VisualBasic.FileIO
namespace. The advantage of using this is that it's 100% native; all you need to do is add Microsoft.VisualBasic
to the references.
Having done that, you have the data as List<String[]>
. Now, things get interesting. See, now we can create a custom attribute and add it to our class properties:
The attribute class:
[AttributeUsage(AttributeTargets.Property)]
public sealed class CsvColumnAttribute : System.Attribute
{
public String Name { get; private set; }
public Regex ValidationRegex { get; private set; }
public CsvColumnAttribute(String name) : this(name, null) { }
public CsvColumnAttribute(String name, String validationRegex)
{
this.Name = name;
this.ValidationRegex = new Regex(validationRegex ?? "^.*$");
}
}
The data class:
public class AddressInfo
{
[CsvColumnAttribute("SITE_ID", "^\\d+$")]
public Int32 SiteId { get; set; }
[CsvColumnAttribute("HOUSE", "^\\d+$")]
public Int32 House { get; set; }
[CsvColumnAttribute("STREET", "^[a-zA-Z0-9- ]+$")]
public String Street { get; set; }
[CsvColumnAttribute("CITY", "^[a-zA-Z0-9- ]+$")]
public String City { get; set; }
[CsvColumnAttribute("STATE", "^[a-zA-Z]{2}$")]
public String State { get; set; }
[CsvColumnAttribute("ZIP", "^\\d{1,5}$")]
public Int32 Zip { get; set; }
[CsvColumnAttribute("APARTMENT", "^\\d*$")]
public Int32? Apartment { get; set; }
}
As you see, what I did here was link every property to a CSV column name, and give it a regex to validate the contents. On non-required stuff, you can still do regexes, but ones that allow empty values, as shown in the Apartment one.
Now, to actually match the columns to the CSV headers, we need to get the properties of the AddressInfo
class, check for each property whether it has a CsvColumnAttribute
, and if it does, match its name to the column headers of the CSV file data. Once we have that, we got a list of PropertyInfo
objects, which can be used to dynamically fill in the properties of new objects created for all rows.
This method is completely generic, allows giving the columns in any order in the CSV file, and parsing will work for any class once you assign the CsvColumnAttribute
to the properties you want to fill in. It will automatically validate the data, and you can handle failures however you want. In this code, all I do is skip invalid lines, though.
public static List<T> ParseCsvInfo<T>(List<String[]> split) where T : new()
{
// No template row, or only a template row but no data. Abort.
if (split.Count < 2)
return new List<T>();
String[] templateRow = split[0];
// Create a dictionary of rows and their index in the file data.
Dictionary<String, Int32> columnIndexing = new Dictionary<String, Int32>();
for (Int32 i = 0; i < templateRow.Length; i++)
{
// ToUpperInvariant is optional, of course. You could have case sensitive headers.
String colHeader = templateRow[i].Trim().ToUpperInvariant();
if (!columnIndexing.ContainsKey(colHeader))
columnIndexing.Add(colHeader, i);
}
// Prepare the arrays of property parse info. We set the length
// so the highest found column index exists in it.
Int32 numCols = columnIndexing.Values.Max() + 1;
// Actual property to fill in
PropertyInfo[] properties = new PropertyInfo[numCols];
// Regex to validate the string before parsing
Regex[] propValidators = new Regex[numCols];
// Type converters for automatic parsing
TypeConverter[] propconverters = new TypeConverter[numCols];
// go over the properties of the given type, see which ones have a
// CsvColumnAttribute, and put these in the list at their CSV index.
foreach (PropertyInfo p in typeof(T).GetProperties())
{
object[] attrs = p.GetCustomAttributes(true);
foreach (Object attr in attrs)
{
CsvColumnAttribute csvAttr = attr as CsvColumnAttribute;
if (csvAttr == null)
continue;
Int32 index;
if (!columnIndexing.TryGetValue(csvAttr.Name.ToUpperInvariant(), out index))
{
// If no valid column is found, and the regex for this property
// does not allow an empty value, then all lines are invalid.
if (!csvAttr.ValidationRegex.IsMatch(String.Empty))
return new List<T>();
// No valid column found: ignore this property.
break;
}
properties[index] = p;
propValidators[index] = csvAttr.ValidationRegex;
// Automatic type converter. This function could be enhanced by giving a
// list of custom converters as extra argument and checking those first.
propconverters[index] = TypeDescriptor.GetConverter(p.PropertyType);
break; // Only handle one CsvColumnAttribute per property.
}
}
List<T> objList = new List<T>();
// start from 1 since the first line is the template with the column names
for (Int32 i = 1; i < split.Count; i++)
{
Boolean abortLine = false;
String[] line = split[i];
// make new object of the given type
T obj = new T();
for (Int32 col = 0; col < properties.Length; col++)
{
// It is possible a line is not long enough to contain all columns.
String curVal = col < line.Length ? line[col] : String.Empty;
PropertyInfo prop = properties[col];
// this can be null if the column was not found but wasn't required.
if (prop == null)
continue;
// check validity. Abort buildup of this object if not valid.
Boolean valid = propValidators[col].IsMatch(curVal);
if (!valid)
{
// Add logging here? We have the line and column index.
abortLine = true;
break;
}
// Automated parsing. Always use nullable types for nullable properties.
Object value = propconverters[col].ConvertFromString(curVal);
prop.SetValue(obj, value, null);
}
if (!abortLine)
objList.Add(obj);
}
return objList;
}
To use on your CSV file, simply do
// the function using VB's TextFieldParser
List<String[]> splitData = SplitFile(datafile, new UTF8Encoding(false), ',');
// The above function, applied to the AddressInfo class
List<AddressInfo> addresses = ParseCsvInfo<AddressInfo>(splitData);
And that's it. Automatic parsing and validation, all through some added attributes on the class properties.
Note, if splitting the data in advance would give too much of a performance hit for large data, that's not really a problem; the TextFieldParser
works from a Stream
wrapped in a TextReader
, so instead of giving a List<String[]>
you can just give a stream and do the csv parsing on the fly inside the ParseCsvInfo
function, simply reading per CSV line directly from the TextFieldParser
.
I didn't do that here because the original use case for csv reading for which I wrote the reader to List<String[]>
included automatic encoding detection, which required reading the whole file anyway.