4

I have a requirement where user will be uploading a csv file in the below format which will contain around 1.8 to 2 million records

SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT  
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2  
44,608646,TEXAS AVE,ODESSA,TX,79762,  
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,  
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4  
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212  

What i have to do is, first validate the file and then save it in database iff its validated successfully and has no errors. The validations that i have to apply are different for each column. For example,

SITE_ID: it can only be an integer and it is required.  
HOUSE: integer, required  
STREET: alphanumeric, required  
CITY: alphabets only, required  
State: 2 alphabets only, required  
zip: 5 digits only, required  
APARTMENT: integer only, optional  

I need a generic way of applying these validations to respective columns. What i have tried so far is that i converted the csv file to dataTable and i plan to try and validate each cell through regex but this doesn't seem like a generic or good solution to me. Can anyone help me in this regard and point me to the right direction?

Danny_ds
  • 11,201
  • 1
  • 24
  • 46
Qurat
  • 137
  • 1
  • 1
  • 11
  • 3
    I would do my own class with the properties of the csv, and then custom attributes to specify the validation of the property (So i can change later or extend them) or validate them during the mapping method. – Nekeniehl Feb 14 '18 at 13:47
  • Can following answer help you? https://stackoverflow.com/a/16608967/4222487 – FaizanHussainRabbani Feb 14 '18 at 14:00
  • If this file has fixed structure (same columns every time) - why do you need generic approach? – Evk Feb 14 '18 at 14:03
  • See my code at following posting : https://stackoverflow.com/questions/30129406/reading-data-from-csv-to-screen-output – jdweng Feb 14 '18 at 14:08
  • @Evk generic approach is useful if there are multiple types of CSV files, each with their own defined specs. My answer covers that completely. – Nyerguds Feb 15 '18 at 21:21

4 Answers4

2

Here is one efficient method :

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Data;
using System.Data.OleDb;
using System.Text.RegularExpressions;
using System.IO;


namespace ConsoleApplication23
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.csv";
        static void Main(string[] args)
        {
            CSVReader csvReader = new CSVReader();
            DataSet ds = csvReader.ReadCSVFile(FILENAME, true);
            RegexCompare compare = new RegexCompare();
            DataTable errors = compare.Get_Error_Rows(ds.Tables[0]);
        }
    }
    class RegexCompare
    {
        public static Dictionary<string,RegexCompare> dict =  new Dictionary<string,RegexCompare>() {
               { "SITE_ID", new RegexCompare() { columnName = "SITE_ID", pattern = @"[^\d]+", positveNegative = false, required = true}},
               { "HOUSE", new RegexCompare() { columnName = "HOUSE", pattern = @"[^\d]+", positveNegative = false, required = true}}, 
               { "STREET", new RegexCompare() { columnName = "STREET", pattern = @"[A-Za-z0-9 ]+", positveNegative = true, required = true}}, 
               { "CITY", new RegexCompare() { columnName = "CITY", pattern = @"[A-Za-z ]+", positveNegative = true, required = true}},
               { "STATE", new RegexCompare() { columnName = "STATE", pattern = @"[A-Za-z]{2}", positveNegative = true, required = true}},
               { "ZIP", new RegexCompare() { columnName = "ZIP", pattern = @"\d{5}", positveNegative = true, required = true}},
               { "APARTMENT", new RegexCompare() { columnName = "APARTMENT", pattern = @"\d*", positveNegative = true, required = false}},
            };


        string columnName { get; set;}
        string pattern { get; set; }
        Boolean positveNegative { get; set; }
        Boolean required { get; set; }

        public DataTable Get_Error_Rows(DataTable dt)
        {
            DataTable dtError = null;
            foreach (DataRow row in dt.AsEnumerable())
            {
                Boolean error = false;
                foreach (DataColumn col in dt.Columns)
                {
                    RegexCompare regexCompare = dict[col.ColumnName];
                    object colValue = row.Field<object>(col.ColumnName);
                    if (regexCompare.required)
                    {
                        if (colValue == null)
                        {
                            error = true;
                            break;
                        }
                    }
                    else
                    {
                        if (colValue == null)
                            continue;
                    }
                    string colValueStr = colValue.ToString();
                    Match match = Regex.Match(colValueStr, regexCompare.pattern);
                    if (regexCompare.positveNegative)
                    {
                        if (!match.Success)
                        {
                            error = true;
                            break;
                        }
                        if (colValueStr.Length != match.Value.Length)
                        {
                            error = true;
                            break;
                        }
                    }
                    else
                    {
                        if (match.Success)
                        {
                            error = true;
                            break;
                        }
                    }

                }

                if(error)
                {
                    if (dtError == null) dtError = dt.Clone();
                    dtError.Rows.Add(row.ItemArray);
                }
            }
            return dtError;
        }
    }

    public class CSVReader
    {

        public DataSet ReadCSVFile(string fullPath, bool headerRow)
        {

            string path = fullPath.Substring(0, fullPath.LastIndexOf("\\") + 1);
            string filename = fullPath.Substring(fullPath.LastIndexOf("\\") + 1);
            DataSet ds = new DataSet();

            try
            {
                if (File.Exists(fullPath))
                {
                    string ConStr = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0}" + ";Extended Properties=\"Text;HDR={1};FMT=Delimited\\\"", path, headerRow ? "Yes" : "No");
                    string SQL = string.Format("SELECT * FROM {0}", filename);
                    OleDbDataAdapter adapter = new OleDbDataAdapter(SQL, ConStr);
                    adapter.Fill(ds, "TextFile");
                    ds.Tables[0].TableName = "Table1";
                }
                foreach (DataColumn col in ds.Tables["Table1"].Columns)
                {
                    col.ColumnName = col.ColumnName.Replace(" ", "_");
                }
            }

            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
            return ds;
        }
    }

}
jdweng
  • 33,250
  • 2
  • 15
  • 20
  • So what is `CSVReader`? What namespace/package is it from? – Nyerguds Feb 15 '18 at 19:08
  • I got `System.Data` referenced and `System.Data.OleDb` in my `using`, but it doesn't know any such class. In fact the namespace doesn't contain anything not starting with "OleDb". Which .Net version is that? – Nyerguds Feb 15 '18 at 20:26
  • I just noticed some bugs that I fixed. I had to rearrange some of the If/Else statements. – jdweng Feb 15 '18 at 20:42
  • Try from menu : Project : Add Reference : Net: System.Data. If there is a System.Data.Oledb add that. Some versions of VS adds different number of libraries to keep memory usage low. Loading the Data library will add all the classes. – jdweng Feb 15 '18 at 20:45
  • Yeah, no, there's really nothing there. The class does not exist on MSDN either. Check its namespace, please. – Nyerguds Feb 15 '18 at 21:23
  • Here is the source at MSDN and is part of Net 4.7.1 : http://referencesource.microsoft.com/#System.Data/System/Data/OleDb/OleDbConnection.cs,2301726da0e82e9c. – jdweng Feb 15 '18 at 21:36
  • Why are you even linking that? **There is not a single mention of a CSVReader there.** Hover over the CSVReader in your code in Visual Studio, and see what namespace it is _really_ in. – Nyerguds Feb 15 '18 at 21:51
  • It is my own code based on oledb. The class is shown in the posting. Didn't you even look at my code. – jdweng Feb 16 '18 at 00:33
  • Ohh, the CSVReader class is _in the code_. I see. Why did you say it was in OleDb then? :-\ – Nyerguds Feb 16 '18 at 00:42
2

Here's a rather overengineered but really fun generic method, where you give attributes to your class to match them to CSV column headers:

First step is to parse your CSV. There are a variety of methods out there, but my favourite is the TextFieldParser that can be found in the Microsoft.VisualBasic.FileIO namespace. The advantage of using this is that it's 100% native; all you need to do is add Microsoft.VisualBasic to the references.

Having done that, you have the data as List<String[]>. Now, things get interesting. See, now we can create a custom attribute and add it to our class properties:

The attribute class:

[AttributeUsage(AttributeTargets.Property)]
public sealed class CsvColumnAttribute : System.Attribute
{
    public String Name { get; private set; }
    public Regex ValidationRegex { get; private set; }

    public CsvColumnAttribute(String name) : this(name, null) { }

    public CsvColumnAttribute(String name, String validationRegex)
    {
        this.Name = name;
        this.ValidationRegex = new Regex(validationRegex ?? "^.*$");
    }
}

The data class:

public class AddressInfo
{
    [CsvColumnAttribute("SITE_ID", "^\\d+$")]
    public Int32 SiteId { get; set; }

    [CsvColumnAttribute("HOUSE", "^\\d+$")]
    public Int32 House { get; set; }

    [CsvColumnAttribute("STREET", "^[a-zA-Z0-9- ]+$")]
    public String Street { get; set; }

    [CsvColumnAttribute("CITY", "^[a-zA-Z0-9- ]+$")]
    public String City { get; set; }

    [CsvColumnAttribute("STATE", "^[a-zA-Z]{2}$")]
    public String State { get; set; }

    [CsvColumnAttribute("ZIP", "^\\d{1,5}$")]
    public Int32 Zip { get; set; }

    [CsvColumnAttribute("APARTMENT", "^\\d*$")]
    public Int32? Apartment { get; set; }
}

As you see, what I did here was link every property to a CSV column name, and give it a regex to validate the contents. On non-required stuff, you can still do regexes, but ones that allow empty values, as shown in the Apartment one.

Now, to actually match the columns to the CSV headers, we need to get the properties of the AddressInfo class, check for each property whether it has a CsvColumnAttribute, and if it does, match its name to the column headers of the CSV file data. Once we have that, we got a list of PropertyInfo objects, which can be used to dynamically fill in the properties of new objects created for all rows.

This method is completely generic, allows giving the columns in any order in the CSV file, and parsing will work for any class once you assign the CsvColumnAttribute to the properties you want to fill in. It will automatically validate the data, and you can handle failures however you want. In this code, all I do is skip invalid lines, though.

public static List<T> ParseCsvInfo<T>(List<String[]> split) where T : new()
{
    // No template row, or only a template row but no data. Abort.
    if (split.Count < 2)
        return new List<T>();
    String[] templateRow = split[0];
    // Create a dictionary of rows and their index in the file data.
    Dictionary<String, Int32> columnIndexing = new Dictionary<String, Int32>();
    for (Int32 i = 0; i < templateRow.Length; i++)
    {
        // ToUpperInvariant is optional, of course. You could have case sensitive headers.
        String colHeader = templateRow[i].Trim().ToUpperInvariant();
        if (!columnIndexing.ContainsKey(colHeader))
            columnIndexing.Add(colHeader, i);
    }
    // Prepare the arrays of property parse info. We set the length
    // so the highest found column index exists in it.
    Int32 numCols = columnIndexing.Values.Max() + 1;
    // Actual property to fill in
    PropertyInfo[] properties = new PropertyInfo[numCols];
    // Regex to validate the string before parsing
    Regex[] propValidators = new Regex[numCols];
    // Type converters for automatic parsing
    TypeConverter[] propconverters = new TypeConverter[numCols];
    // go over the properties of the given type, see which ones have a
    // CsvColumnAttribute, and put these in the list at their CSV index.
    foreach (PropertyInfo p in typeof(T).GetProperties())
    {
        object[] attrs = p.GetCustomAttributes(true);
        foreach (Object attr in attrs)
        {
            CsvColumnAttribute csvAttr = attr as CsvColumnAttribute;
            if (csvAttr == null)
                continue;
            Int32 index;
            if (!columnIndexing.TryGetValue(csvAttr.Name.ToUpperInvariant(), out index))
            {
                // If no valid column is found, and the regex for this property
                // does not allow an empty value, then all lines are invalid.
                if (!csvAttr.ValidationRegex.IsMatch(String.Empty))
                    return new List<T>();
                // No valid column found: ignore this property.
                break;
            }
            properties[index] = p;
            propValidators[index] = csvAttr.ValidationRegex;
            // Automatic type converter. This function could be enhanced by giving a
            // list of custom converters as extra argument and checking those first.
            propconverters[index] = TypeDescriptor.GetConverter(p.PropertyType);
            break; // Only handle one CsvColumnAttribute per property.
        }
    }
    List<T> objList = new List<T>();
    // start from 1 since the first line is the template with the column names
    for (Int32 i = 1; i < split.Count; i++)
    {
        Boolean abortLine = false;
        String[] line = split[i];
        // make new object of the given type
        T obj = new T();
        for (Int32 col = 0; col < properties.Length; col++)
        {
            // It is possible a line is not long enough to contain all columns.
            String curVal = col < line.Length ? line[col] : String.Empty;
            PropertyInfo prop = properties[col];
            // this can be null if the column was not found but wasn't required.
            if (prop == null)
                continue;
            // check validity. Abort buildup of this object if not valid.
            Boolean valid = propValidators[col].IsMatch(curVal);
            if (!valid)
            {
                // Add logging here? We have the line and column index.
                abortLine = true;
                break;
            }
            // Automated parsing. Always use nullable types for nullable properties.
            Object value = propconverters[col].ConvertFromString(curVal);
            prop.SetValue(obj, value, null);
        }
        if (!abortLine)
            objList.Add(obj);
    }
    return objList;
}

To use on your CSV file, simply do

// the function using VB's TextFieldParser
List<String[]> splitData = SplitFile(datafile, new UTF8Encoding(false), ',');
// The above function, applied to the AddressInfo class
List<AddressInfo> addresses = ParseCsvInfo<AddressInfo>(splitData);

And that's it. Automatic parsing and validation, all through some added attributes on the class properties.

Note, if splitting the data in advance would give too much of a performance hit for large data, that's not really a problem; the TextFieldParser works from a Stream wrapped in a TextReader, so instead of giving a List<String[]> you can just give a stream and do the csv parsing on the fly inside the ParseCsvInfo function, simply reading per CSV line directly from the TextFieldParser.

I didn't do that here because the original use case for csv reading for which I wrote the reader to List<String[]> included automatic encoding detection, which required reading the whole file anyway.

Nyerguds
  • 5,360
  • 1
  • 31
  • 63
1

I would suggest to using a CSV-library to read the file.
For example you can use LumenWorksCsvReader: https://www.nuget.org/packages/LumenWorksCsvReader

Your approach with an regex validation is actually ok. For example, you could create a "Validation Dictionary" and check every CSV Value against the regex-expression.

Then you can build a function that can validate a CSV-File with such a "Validation Dictionary".

See here:

string lsInput = @"SITE_ID,HOUSE,STREET,CITY,STATE,ZIP,APARTMENT
44,545395,PORT ROYAL,CORPUS CHRISTI,TX,78418,2
44,608646,TEXAS AVE,ODESSA,TX,79762,
44,487460,EVERHART RD,CORPUS CHRISTI,TX,78413,
44,275543,EDWARD GARY,SAN MARCOS,TX,78666,4
44,136811,MAGNOLIA AVE,SAN ANTONIO,TX,78212";

Dictionary<string, string> loValidations = new Dictionary<string, string>();
loValidations.Add("SITE_ID", @"^\d+$"); //it can only be an integer and it is required.
//....

bool lbValid = true;
using (CsvReader loCsvReader = new CsvReader(new StringReader(lsInput), true, ','))
{
    while (loCsvReader.ReadNextRecord())
    {
        foreach (var loValidationEntry in loValidations)
        {
            if (!Regex.IsMatch(loCsvReader[loValidationEntry.Key], loValidationEntry.Value))
            {
                lbValid = false;
                break;
            }
        }
        if (!lbValid)
            break;
    }
}
Console.WriteLine($"Valid: {lbValid}");
PinBack
  • 2,499
  • 12
  • 16
  • There is a CSV library natively available in .Net, actually. They just hid it pretty well, in the VB namespace. – Nyerguds Feb 26 '18 at 21:38
0

Here is another way to accomplish your needs using Cinchoo ETL - an open source file helper library.

First define a POCO class with DataAnnonations validation attributes as below

public class Site
{
    [Required(ErrorMessage = "SiteID can't be null")]
    public int SiteID { get; set; }
    [Required]
    public int House { get; set; }
    [Required]
    public string Street { get; set; }
    [Required]
    [RegularExpression("^[a-zA-Z][a-zA-Z ]*$")]
    public string City { get; set; }
    [Required(ErrorMessage = "State is required")]
    [RegularExpression("^[A-Z][A-Z]$", ErrorMessage = "Incorrect zip code.")]
    public string State { get; set; }
    [Required]
    [RegularExpression("^[0-9][0-9]*$")]
    public string Zip { get; set; }
    public int Apartment { get; set; }
}

then use this class with ChoCSVReader to load and check the validity of the file using Validate()/IsValid() method as below

using (var p = new ChoCSVReader<Site>("*** YOUR CSV FILE PATH ***")
    .WithFirstLineHeader(true)
    )
{
    Exception ex;
    Console.WriteLine("IsValid: " + p.IsValid(out ex));
}

Hope it helps.

Disclaimer: I'm the author of this library.

Cinchoo
  • 6,088
  • 2
  • 19
  • 34