Create 2D Array from CSV and obtain Word Count of Specified Column

Question

I have a CSV file that looks like this:

,Location_Code,Location_Desc,Type_Code,Fault_type,Prod_Number,Model,Causer,Auditor,Prio,Capture_Date,Steer,Engine,Country,Current shift number,VIN,Comment,Shift,Year,Fault location C_Code,Fault location C_Desc,Fault type C_Code,Fault type C_Desc,Comment R,Baumuster Sales desc.,Baumuster Technical desc.,T24
0,09122,Engine,42,Poor fit,7117215,W205,Final 3,"Plant 1, WSA",0,2019-04-05,1,83,705,T1220190404T0092,55SWF8DB7KU316971,,A,2019,,,,,,C 300,205 E20 G,
1,09122,Engine,42,Poor fit,7117235,W205,Final 3,"Plant 1, WSA",0,2019-04-05,1,83,705,T1220190404T0122,55SWF8DB2KU316991,,A,2019,,,,,,C 300,205 E20 G,
2,09122,Transmission,42,Poor fit,7117237,W205,Final 3,"Plant 1, WSA",0,2019-04-05,1,83,705,T1220190404T0126,55SWF8DB6KU316993,,A,2019,,,,,,C 300,205 E20 G,

I want to write code that gets the word count of words of a selected column header after tokenizing the words of the selected column (in dictionary style key-value pairs). I also want to keep the word count sorted by value in descending order. eg.

Location_Desc

Engine: 2

Transmission: 1

This is the code I have so far:

            int colNumber;
            for(colNumber=0; colNumber<columns.Length; colNumber++)
            {
                if ( columns[colNumber].Equals(columnHeader))
                {
                    break;
                }
            }

            Debug.WriteLine("Column Number: " + colNumber);
            for(int i=0; i<inputCsv.Length; i++)
            {
                string[] row = inputCsv[i].Split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
                string column = row[colNumber];
                Debug.WriteLine(row.ToString());
            }

I was able to get the column header name via a for loop, but not only am I unable to ignore commas inside of quotations, I was unable to get the values from the column header (also known as a Series in Python's Pandas).

Help is much appreciated!

There are a lot of libraries to read CSV files. For example, [CsvHelper](https://joshclose.github.io/CsvHelper/). Consider using it to read CSV file. It will save you a lot of time. — Iliar Turdushev, May 06 '20 at 05:29
`I want to write code that gets the word count of words of a selected column header after tokenizing the words of the selected column (in dictionary style key-value pairs).` I have a question related to this part of you problem. What is expected result for column `Fault_type` in your sample: **1.** one pair `("Poor fit": 3)` or **2.** two pairs `("Poor": 3)` and `("fit": 3)`? — Iliar Turdushev, May 06 '20 at 05:33

RoadRunner · Answer 1 · 2020-05-07T06:46:40.943

I would probably store your counts in Dictionary<string, Dictionary<string , long>>, rather than a 2D array. Then you can access each columns counts much easier.

Using the CsvHelper NuGet package, we can create a class to model your CSV file. The only thing we have to be careful here is choosing the correct datatypes for your columns. The data types I've chosen may not be the best for your situation. You can also find the The API Documentation here.

public class CsvModel
{
    [Name("")]
    public string RowNumber { get; set; }
    [Name("Location_Code")]
    public string LocationCode { get; set; }
    [Name("Location_Desc")]
    public string LocationDesc { get; set; }
    [Name("Type_Code")]
    public long TypeCode { get; set; }
    [Name("Fault_type")]
    public string FaultType { get; set; }
    [Name("Prod_Number")]
    public long ProdNumber { get; set; }
    public string Model { get; set; }
    public string Causer { get; set; }
    public string Auditor { get; set; }
    public long Prio { get; set; }
    [Name("Capture_Date")]
    public DateTime CaptureDate { get; set; }
    public long Steer { get; set; }
    public long Engine { get; set; }
    public long Country { get; set; }
    [Name("Current shift number")]
    public string CurrentShiftNumber { get; set; }
    public string VIN { get; set; }
    public string Comment { get; set; }
    public string Shift { get; set; }
    public long Year { get; set; }
    [Name("Fault location C_Code")]
    public string FaultLocationCCode { get; set; }
    [Name("Fault location C_Desc")]
    public string FaultLocationCDesk { get; set; }
    [Name("Fault type C_Code")]
    public string FaultTypeCCode { get; set; }
    [Name("Fault type C_Desc")]
    public string FaultTypeCDesc { get; set; }
    [Name("Comment R")]
    public string CommentR { get; set; }
    [Name("Baumuster Sales desc.")]
    public string BaumusterSalesDesc { get; set; }
    [Name("Baumuster Technical desc.")]
    public string BaumusterTechnicalDesc { get; set; }
    public string T24 { get; set; }
}

Then we can read the records into a IEnumerable<CsvMode> with GetRecords<T>:

var path = "C:\\data.csv";

using var reader = new StreamReader(path);

using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);

var records = csv.GetRecords<CsvModel>();

Then use reflection to get the column counts into a Dictionary<string, Dictionary<string , long>>:

var recordCounts = new Dictionary<string, Dictionary<string, long>>();

foreach (var record in records)
{
    var properties = record.GetType().GetProperties();

    foreach (var property in properties)
    {
        var propertyName = property.Name;
        if (!recordCounts.ContainsKey(propertyName))
        {
            recordCounts.Add(propertyName, new Dictionary<string, long>());
        }

        var propertyValue = property.GetValue(record, null);
        var propertyKey = propertyValue.ToString();
        if (propertyValue != null && !string.IsNullOrEmpty(propertyKey))
        {
            var count = recordCounts[propertyName].GetValueOrDefault(propertyKey, 0) + 1;
            recordCounts[propertyName][propertyKey] = count;
        }
    }
}

Then we can sort the column counts in descending order by creating a new dictionary with LINQ:

var sortedRecordCounts = recordCounts
    .ToDictionary(
        kvp => kvp.Key, 
        kvp => new SortedDictionary<string, long>(
            kvp.Value.OrderByDescending(kvp => kvp.Value)
                     .ToDictionary(
                         kvp => kvp.Key, 
                         kvp => kvp.Value)));

Which uses Enumerable.ToDictionary to create dictionaries(inner + outer), and sorts the counts in descending order with Enumerable.OrderByDescending.

We also use OrderedDictionary to guarantee sort order of the dictionary, since Dictionary ordering is not guaranteed.

This is also briefly mentioned in MSDN:

The order in which the items are returned is undefined.

We can then we can iterate this dictionary to display the record counts, which also indicates if no valid(empty or null) values were found:

foreach (var kvp in sortedRecordCounts)
{
    Console.WriteLine($"Column: {kvp.Key}");

    if (kvp.Value.Count == 0)
    {
        Console.WriteLine("No values found");
    }

    foreach (var value in kvp.Value)
    {
        Console.WriteLine($"Value: {value.Key}, Count: {value.Value}");
    }

    Console.WriteLine();
}

Output:

Column: RowNumber
Value: 0, Count: 1
Value: 1, Count: 1
Value: 2, Count: 1

Column: LocationCode
Value: 09122, Count: 3

Column: LocationDesc
Value: Engine, Count: 2
Value: Transmission, Count: 1

Column: TypeCode
Value: 42, Count: 3

Column: FaultType
Value: Poor fit, Count: 3

Column: ProdNumber
Value: 7117215, Count: 1
Value: 7117235, Count: 1
Value: 7117237, Count: 1

Column: Model
Value: W205, Count: 3

Column: Causer
Value: Final 3, Count: 3

Column: Auditor
Value: Plant 1, WSA, Count: 3

Column: Prio
Value: 0, Count: 3

Column: CaptureDate
Value: 5/04/2019 12:00:00 AM, Count: 3

Column: Steer
Value: 1, Count: 3

Column: Engine
Value: 83, Count: 3

Column: Country
Value: 705, Count: 3

Column: CurrentShiftNumber
Value: T1220190404T0092, Count: 1
Value: T1220190404T0122, Count: 1
Value: T1220190404T0126, Count: 1

Column: VIN
Value: 55SWF8DB7KU316971, Count: 1
Value: 55SWF8DB2KU316991, Count: 1
Value: 55SWF8DB6KU316993, Count: 1

Column: Comment
No values found

Column: Shift
Value: A, Count: 3

Column: Year
Value: 2019, Count: 3

Column: FaultLocationCCode
No values found

Column: FaultLocationCDesk
No values found

Column: FaultTypeCCode
No values found

Column: FaultTypeCDesc
No values found

Column: CommentR
No values found

Column: BaumusterSalesDesc
Value: C 300, Count: 3

Column: BaumusterTechnicalDesc
Value: 205 E20 G, Count: 3

Column: T24
No values found

Update

If you want to support multiple CSV files and not storing a class for columns(avoiding reflection as well), you could use a generic solution like this:

var path = "C:\\data.csv";

var recordCounts = new Dictionary<string, Dictionary<string, long>>();

using (var reader = new StreamReader(path))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
    csv.Read();
    csv.ReadHeader();
    var headerRow = csv.Context.HeaderRecord;

    if (string.IsNullOrEmpty(headerRow[0]))
    {
        headerRow[0] = "RowNumber";
    }

    foreach (var header in headerRow)
    {
        recordCounts.Add(header, new Dictionary<string, long>());
    }

    while (csv.Read())
    {
        foreach (var header in headerRow)
        {
            var headerKey = header == "RowNumber" ? string.Empty : header;
            var columnValue = csv.GetField(headerKey);
            if (!string.IsNullOrEmpty(columnValue))
            {
                var count = recordCounts[header].GetValueOrDefault(columnValue, 0) + 1;
                recordCounts[header][columnValue] = count;
            }

        }
    }
}

Which uses the header reading method from What is the best way to get the list of column names using CsvHelper?, and uses the Reading by Hand method offered from the CsvHelper documentation. These resources and suggestions were helpfully suggested by @Iliar Turdushev in the comments.

You could then combine the above solution with the LINQ dictionary sorting query and printing code above to produce similar results.

Create 2D Array from CSV and obtain Word Count of Specified Column

1 Answers1

Update