2

I'm writing the following method in C# to parse a CSV file and write values to a SQL Server database.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;

    public List<Entity> ParseEntityExclusionFile(List<string> entries, string urlFile)
    {
      entries.RemoveAt(0);
      List<Entity> entities = new List<Entity>();
      foreach (string line in entries)
      {
        Entity exclusionEntity = new Entity();
        string[] lineParts = line.Split(',').Select(p => p.Trim('\"')).ToArray();
        exclusionEntity.Id = 1000;
        exclusionEntity.Names = lineParts[3];
        exclusionEntity.Identifier = $"{lineParts[25]}" + $" | " + $"Classification: " + ${lineParts[0]}";

        entities.Add(exclusionEntity);

      }
      return entities;
    }

The data in some columns of the csv file are comma-separated values inside a set of parentheses, meant to represent one value for that column. So for any values like that, I need to capture it as one value to go into one field in the database. How would I adjust/add-to the line of code string[] lineParts = line.Split(',').Select(p => p.Trim('\"')).ToArray(); to instruct the application that if it encounters a column with open parenthesis, capture all the data after the open parenthesis, including the commas, until the close parenthesis, all as one value?

EDIT: it seems the Select(p => p.Trim('\"')).ToArray(); part of the above line of code is confusing some folks - don't worry about that part - I just need to know how I would go about adding 'exception' code to create a condition where Split(',') is ignored if the commas happen to be in between a set of parentheses. One field in the csv file looks like this (1,2,3,4) - currently the code parses it as four fields, whereas I need that parsed as one field like 1,2,3,4 OR (1,2,3,4) it actually doesn't matter whether the resulting fields contain the parentheses or not.

EDIT 2: I appreciate the suggestions of using a .NET CSV library - however, everything is working perfectly in this project outside of this one field in the csv file containing a set of parentheses with comma-separated values inside - I feel as though it's a bit overkill to install and configure an entire library, including having to set up new models and properties, just for this one column of data.

Stpete111
  • 3,109
  • 4
  • 34
  • 74

2 Answers2

2

Try this code:

public static class Ex
{
    private static string Peek(this string source, int peek) => (source == null || peek < 0) ? null : source.Substring(0, source.Length < peek ? source.Length : peek);
    private static (string, string) Pop(this string source, int pop) => (source == null || pop < 0) ? (null, source) : (source.Substring(0, source.Length < pop ? source.Length : pop), source.Length < pop ? String.Empty : source.Substring(pop));

    public static string[] ParseCsvLine(this string line)
    {
        return ParseCsvLineImpl(line).ToArray();
        IEnumerable<string> ParseCsvLineImpl(string l)
        {
            string remainder = line;
            string field;
            while (remainder.Peek(1) != "")
            {
                (field, remainder) = ParseField(remainder);
                yield return field;
            }
        }
    }

    private const string DQ = "\"";

    private static (string field, string remainder) ParseField(string line)
    {
        if (line.Peek(1) == DQ)
        {
            var (_, split) = line.Pop(1);
            return ParseFieldQuoted(split);
        }
        else
        {
            var field = "";
            var (head, tail) = line.Pop(1);
            while (head != "," && head != "")
            {
                field += head;
                (head, tail) = tail.Pop(1);
            }
            return (field, tail);
        }
    }

    private static (string field, string remainder) ParseFieldQuoted(string line)
    {
        var field = "";
        var head = "";
        var tail = line;
        while (tail.Peek(1) != "" && (tail.Peek(1) != DQ || tail.Peek(2) == DQ + DQ))
        {
            if (tail.Peek(2) == DQ + DQ)
            {
                (head, tail) = tail.Pop(2);
                field += DQ;
            }
            else
            {
                (head, tail) = tail.Pop(1);
                field += head;
            }
        }
        if (tail.Peek(2) == DQ + ",")
        {
            (head, tail) = tail.Pop(2);
        }
        else if (tail.Peek(1) == DQ)
        {
            (head, tail) = tail.Pop(1);
        }
        return (field, tail);
    }
}

It handles double-quotes, and double-double-quotes.

You can then do this:

string line = "45,\"23\"\",34\",66"; // 45,"23"",34",66
string[] fields = line.ParseCsvLine();

That produces:

45 
23",34 
66 

Here's an updated version of my code that deals with ( and ) as delimiters. It deals with nested delimiters and treats them as part of the field string.

You would need to remove the " as you see fit - I'm not entirely sure why you are doing this.

Also, this is no longer CSV. The parenthesis are not a normal part of CSV. I've changed the name of the method to ParseLine as a result.

public static class Ex
{
    private static string Peek(this string source, int peek) => (source == null || peek < 0) ? null : source.Substring(0, source.Length < peek ? source.Length : peek);
    private static (string, string) Pop(this string source, int pop) => (source == null || pop < 0) ? (null, source) : (source.Substring(0, source.Length < pop ? source.Length : pop), source.Length < pop ? String.Empty : source.Substring(pop));

    public static string[] ParseLine(this string line)
    {
        return ParseLineImpl(line).ToArray();
        IEnumerable<string> ParseLineImpl(string l)
        {
            string remainder = line;
            string field;
            while (remainder.Peek(1) != "")
            {
                (field, remainder) = ParseField(remainder);
                yield return field;
            }
        }
    }

    private const string GroupOpen = "(";
    private const string GroupClose = ")";

    private static (string field, string remainder) ParseField(string line)
    {
        if (line.Peek(1) == GroupOpen)
        {
            var (_, split) = line.Pop(1);
            return ParseFieldQuoted(split);
        }
        else
        {
            var field = "";
            var (head, tail) = line.Pop(1);
            while (head != "," && head != "")
            {
                field += head;
                (head, tail) = tail.Pop(1);
            }
            return (field, tail);
        }
    }

    private static (string field, string remainder) ParseFieldQuoted(string line) => ParseFieldQuoted(line, false);

    private static (string field, string remainder) ParseFieldQuoted(string line, bool isNested)
    {
        var field = "";
        var head = "";
        var tail = line;
        while (tail.Peek(1) != "" && tail.Peek(1) != GroupClose)
        {
            if (tail.Peek(1) == GroupOpen)
            {
                (head, tail) = tail.Pop(1);
                (head, tail) = ParseFieldQuoted(tail, true);
                field += GroupOpen + head + GroupClose;
            }
            else
            {
                (head, tail) = tail.Pop(1);
                field += head;
            }
        }
        if (tail.Peek(2) == GroupClose + ",")
        {
            (head, tail) = tail.Pop(isNested ? 1 : 2);
        }
        else if (tail.Peek(1) == GroupClose)
        {
            (head, tail) = tail.Pop(1);
        }
        return (field, tail);
    }
}

It's used like this:

string line = "45,(23(Fo(,,(,)),(\"Bar\")o),34),66"; // 45,(23(Fo(,,(,)),("Bar")o),34),66
string[] fields = line.ParseLine();
Console.WriteLine(fields.All(f => line.Contains(f))); // True == maybe code is right, False == code is WRONG

And it gives me:

45 
23(Fo(,,(,)),("Bar")o),34 
66 
Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • Based on your comments above and this answer, it doesn't seem you understand what I'm asking for. I know what a parenthesis is, and I know what double quotes are. Thanks for your answer but I'm trying to parse fields with comma-separated values inside a set of parentheses as one value - not sure what you're doing above. – Stpete111 Feb 28 '20 at 13:32
  • Please see the Edit to my post. – Stpete111 Feb 28 '20 at 13:37
  • please accept my sincere apologies: your code actually works almost perfectly to what I need. I was very confused by some of your comments above so I assumed you didn't understand what I was asking. Your code does in fact address what I'm trying to achieve with the values inside the parentheses. The only thing issue with it is when I use `string[] fields = line.ParseCsvLine();` instead of `string[] fields = line.Split(',').Select(p => p.Trim('\"')).ToArray();`, the double quotes are not getting removed from the fields that contain them. Any thoughts on that? – Stpete111 Feb 28 '20 at 18:46
  • Never mind, all I had to do is run it as `string[] fields = line.ParseCsvLine().Select(p => p.Trim('\"')).ToArray();` and now my results are exactly what I need. Do you want to add this to your answer so I can mark it as answer? – Stpete111 Feb 28 '20 at 18:51
  • @Stpete111 - My apologies. The `\"` in your question threw me along with there not being any actual parenthesis in your code. – Enigmativity Feb 28 '20 at 22:51
  • @Stpete111 - So, how does my code work perfectly? It doesn't group by parenthesis. Have you modified it to make it work? – Enigmativity Feb 28 '20 at 22:52
  • @Stpete111 - I've updated the answer with code that should group on the parenthesis nicely. Let me know what you think. – Enigmativity Feb 28 '20 at 23:31
1

First, calling line.Trim('\"') will not strip "any existing double-quotes"; it will only remove all leading and trailing instances of the '\"' char.

var line = "\"\"example \"goes here\"";
var trimmed = line.Trim('\"');
Console.WriteLine(trimmed); //output: example "goes here

Here's how you strip all of the '\"' char:

var line = "\"\"example \"goes here\"";
var trimmed = string.Join(string.Empty, line.Split('"'));
Console.WriteLine(trimmed); //output: example goes here

Notice you can also nix the escape because the " is inside of single quotes.

I'm making an assumption that what your string inputs look like this:

"OneValue,TwoValue,(OneB,TwoB),FiveValue"

or if you have quotes (I'm also assuming you won't actually have quotes inside, but we'll solve for that anyway:

"\"OneValue,TwoValue,(OneB,TwoB),FiveValue\"\""

And I'm expecting your final string[] lineparts variable to have the values in this hard declaration after processing:

var lineparts = new string[] { "OneValue", "TwoValue", "OneB, TwoB", "FiveValue" };

The first solution I can think of is to first split by '(', then iterate over the collection, conditionally splitting by ')' or ',', depending on which side of the opening parenthesis the current element is on. Pretty sure this is linear, so that's neat:

const string l = ",(";
const string r = "),";
const char c = ',';
const char a = '"';

var line = "\"One,Two,(OneB,TwoB),Five\"";
line = string.Join(string.Empty, line.Split(a)); //Strip "

var splitL = line.Split(l); //,(
var partsList = new List<string>();
foreach (var value in splitL)
{
    if (value.Contains(r))//),
    {
        //inside of parentheses, so we keep the values before the ),
        var splitR = value.Split(r);//),
        //I don't like literal indexes, but we know we have at least one element because we have a value.
        partsList.Add(splitR[0]);
        //Everything else is after the closing parenthesis for this group, and before the parenthesis after that
        //so we'll parse it all into different values.
        //The literal index is safe here because split always returns two values if any value is found.
        partsList.AddRange(splitR[1].Split(c));//,
    }
    else
    {
        //before the parentheses, so these are all different values
        partsList.AddRange(value.Split(c));//,
    }
}

var lineparts = partsList.ToArray();//{ "One", "Two", "OneB, TwoB", "Five" };

Here's a better example of a tighter integration with the code in your question, not considering the specific intended values of your Entity properties or the need to trim for quotations:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Threading.Tasks;

public List<Entity> ParseEntityExclusionFile(List<string> entries, string urlFile)
{
    entries.RemoveAt(0);

    const char l = '(';
    const char r = ')';
    const char c = ',';
    const char a = '"';

    List<Entity> entities = new List<Entity>();
    foreach (string line in entries)
    {
        var splitL = line.Split(l); //(
        var partsList = new List<string>();
        foreach (var value in splitL)
        {
            if (value.Contains(r))//)
            {
                var splitR = value.Split(r);//)
                partsList.Add(splitR[0]);
                if (!line.EndsWith(r))
                {
                    partsList.AddRange(splitR[1].Remove(0, 1).Split(c));//,
                }
            }
            else
            {
                if (!line.StartsWith(l))
                {
                    partsList.AddRange(value.Remove(value.Length - 1).Split(c));//,
                }
            }
        }

        var lineParts = partsList.ToArray();//{ "One", "Two", "OneB, TwoB", "Five" };
        entities.Add(new Entity
        {
            Id = 1000,
            Names = lineParts[3],
            Identifier = $"{lineParts[25]} | Classification: {lineParts[0]}";
        });
    }
    return entities;
  }

This solution may get hairy if your groups contain other groups, i.e...

"OneValue,TwoValue,(OneB,(TwoB, ThreeB)),SixValue"
Kyllian Mobley
  • 374
  • 3
  • 10
  • The OP says "some columns of the csv file are comma-separated values inside a set of parentheses, meant to represent one value for that column" - that means in your text `"OneValue,TwoValue,(OneB,TwoB),FiveValue"` the output should be `"OneValue"`, `"TwoValue"`, `"(OneB,TwoB)"`, `"FiveValue"`. – Enigmativity Feb 28 '20 at 03:46
  • The OP also doesn't know the difference between a parenthesis and double-quotes. I'm pretty certain the input should be `"OneValue,TwoValue,\"OneB,TwoB\",FiveValue"`. That's then a valid CSV format. – Enigmativity Feb 28 '20 at 03:47
  • @Enigmativity You're probably right on all those points. I didn't know I could comment on my own post for clarification :) I'm gonna leave my answer as-is until OP clarifies. I think my solution is easy enough to amend considering your first point, i.e. "(OneB, TwoB)" rather than "OneB, TwoB". I think I would just interpolate the group and surround it with parentheses. But yeah, I think OP would definitely be better served with a third-party (or first-party?) CSV parser. – Kyllian Mobley Feb 28 '20 at 04:26
  • 1
    @Stpete111 Please clarify for certain whether you _actually_ need parentheses or really just quotes parsed. – Kyllian Mobley Feb 28 '20 at 04:29
  • 1
    @KyllianMobely please do not take Enigmativity comments into account, I honestly have no idea what he's talking about and he doesn't understand my post based on his comments and his answer. To answer your question, I do NOT need the parentheses in the results. Thank you for taking the time to post your solution. I'm going to look at it now. – Stpete111 Feb 28 '20 at 13:28
  • Please see the Edit to my post. – Stpete111 Feb 28 '20 at 13:37
  • 1
    @KyllianMobley I've updated my post - the results can be with or without the parentheses, it doesn't really matter - whatever is the least amount of code I suppose. – Stpete111 Feb 28 '20 at 13:53
  • 1
    @Stpete111 I think my answer should work for you considering you are looking for groups inside parentheses. If you don't care about stripping apostrophes, you can ignore the line in my answer with string.Join(string.Empty, line.Split(a)). – Kyllian Mobley Feb 28 '20 at 13:59
  • 1
    @Stpete111 I can amend my answer to fit more directly inside the code in your question if you need. – Kyllian Mobley Feb 28 '20 at 14:04
  • Thanks Kyllian, that would be great! – Stpete111 Feb 28 '20 at 14:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208717/discussion-between-kyllian-mobley-and-stpete111). – Kyllian Mobley Feb 28 '20 at 15:29