1

I can't be the first person to have this issue but hours of searching Stack revealed nothing close to an answer. I have an SSIS script that works over a directory of csv files. This script folds, bends and mutilates these files; performs queries, data cleansing, persists some data and finally outputs a small set to csv file that is ingested by another system.

One of the files has a free text field that contains the value: "20,000 BONUS POINTS". This one field, in a file of 10k rows, one of dozens of similar files, is the problem that I can't seem to solve.

Be advised: I'm weak on both C# and Regex.

Sample csv set:

4121,6383,0,,,TRUE
4122,6384,0,"20,000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE

I found plenty of information on how to parse this using a variety of Regex patterns but what I've noticed is the StreamReader.ReadLine() method wraps the complete line with double quotes:

"4121,6383,0,,,TRUE"

such that the output of the regex Replace method:

s = Regex.Replace(line, @"[^\""]([^\""])*[^\""]",
m => m.Value.Replace(",", ""));

looks like this:

412163830TRUE

and the target line that actually contains a double quote delimited string ends up looking like:

"412263840\"20000 BONUS POINTS\"TRUE"

My entire method (for your reading pleasure) is this:

        string fileDirectory = "C:\\tmp\\Unzip\\";
        string fullPath = "C:\\tmp\\Unzip\\test.csv";
        string line = "";
        //int count=0;
        List<string> list = new List<string>();

            try
        {
            //MessageBox.Show("inside Try Block");
            string s = null;
            StreamReader infile = new StreamReader(fullPath);
            StreamWriter outfile = new StreamWriter(Path.Combine(fileDirectory, "output.csv"));
            while ((line = infile.ReadLine()) != null)
            {
                //line.Substring(0,1).Substring(line.Length-1, 1);
                System.Console.WriteLine(line);
                Console.WriteLine(line);
                line = 
                s = Regex.Replace(line, @"[^\""]([^\""])*[^\""]",
                                        m => m.Value.Replace(",", ""));
                System.Console.WriteLine(s);
                list.Add(s);

            }
            foreach (string item in list)
            {
                outfile.WriteLine(item);
            };
            infile.Close();
            outfile.Close();
            //System.Console.WriteLine("There were {0} lines.", count);
        }

        catch (Exception e)
        {

            Console.WriteLine(e.Message);
        }

        //another addition for TFS consumption

    }

Thanks for reading and if you have a useful answer, bless you and your prodigy for generations to come!

mfc

EDIT: The requirement is a valid csv file output. In the case of the test data, it would look like this: 4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE

mf.cummings
  • 303
  • 6
  • 20
  • 1
    Does it have to use regex? – Han Sep 24 '18 at 22:43
  • 1
    `but what I've noticed is the StreamReader.ReadLine() method wraps the complete line with double quotes`... No it doesn't. it might look that way in the debugger but a string is a string is a string and no quotes are "added" to the string using `ReadLine` – Sani Huttunen Sep 24 '18 at 22:44
  • What makes you think that ReadLine wraps the line in double quotes? Are you referring to the quotes you might see within the debugger or watch windows? Also... Why do you need to remove it? SSIS can be configured with text qualifierers, so you could specify a double quote as the specifier and it should be able to read the field properly... The qualifiers don't even need to be around every field – pinkfloydx33 Sep 24 '18 at 22:45
  • I don't have to use regex, I'm open to other solutions. I'm using Visual Studio 2017 Pro and when I walk through the code line by line, observing both the Console.Writeline output as well as the Locals parameter output, I observed the lines wrapped with the double quotes. I should have added that the code lives as a Script Task, I'm just using VS as both a learning experience and for the Breakpoint, Step-Into features. – mf.cummings Sep 24 '18 at 22:47
  • 3
    There are a bunch of CSV parsing libraries out there, **use them**. They have already solved these issues – maccettura Sep 24 '18 at 22:58
  • This `@"[^\""]([^\""])*[^\""]"` is not what you want. Try this Regex.Replace( line, @"""([^""\\\]*(?:\\\[\S\s][^""\\\]*)*)""", m => m.Value.Replace(",", "") ); –  Sep 24 '18 at 23:39
  • 2
    [Stop writing your own CSV parser](http://www.secretgeek.net/csv_trouble); the format is more complicated than you realize. There are plenty of working [CSV parsers](https://stackoverflow.com/questions/2081418/) ready for you to drop in to your application. – Dour High Arch Sep 24 '18 at 23:43

2 Answers2

2

I recommend using a CSV reader lib like others have suggested.

Install-Package LumenWorksCsvReader

https://github.com/phatcher/CsvReader#getting-started

However, if you just want to try something fast and dirty. Give this a try.

If I understand correctly. You need to remove commas between double quotes within each line of a CSV file. This should do that.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        string pattern = @"([""'])(?:(?=(\\?))\2.)*?\1";
        List<string> lines = new List<string>(); 
        lines.Add("4121,6383,0,,,TRUE");
        lines.Add("4122,6384,0,\"20,000 BONUS POINTS\",,TRUE");
        lines.Add("4123,6385,,,,");
        lines.Add("4124,6386,0,,,TRUE");
        lines.Add("4125,6387,0,,,TRUE");
        lines.Add("4126,6388,0,,,TRUE");
        lines.Add("4127,6389,0,,,TRUE");
        lines.Add("4128,6390,0,,,TRUE");

        StringBuilder sb = new StringBuilder();
        foreach (var line in lines)
        {
            sb.Append(Regex.Replace(line, pattern, m => m.Value.Replace(",", ""))+"\n");
        }           
        Console.WriteLine(sb.ToString());
    }
}

OUTPUT

4121,6383,0,,,TRUE
4122,6384,0,"20000 BONUS POINTS",,TRUE
4123,6385,,,,
4124,6386,0,,,TRUE
4125,6387,0,,,TRUE
4126,6388,0,,,TRUE
4127,6389,0,,,TRUE
4128,6390,0,,,TRUE

https://dotnetfiddle.net/flmWG3

fourwhey
  • 490
  • 4
  • 19
  • Sorry for the delay. This works. Thanks! I'm in a secure environment, can't use 3rd party libs without a lot of approvals from InfoSec on down so I'm often looking for creative ways to solve problems when limited to existing tools. – mf.cummings Oct 01 '18 at 17:10
  • I understand. I'm glad to have helped. – fourwhey Oct 01 '18 at 17:15
  • @mf, `TextFieldParser` comes preinstalled with the .NET Framework, you're not allowed to use that? – Dour High Arch Oct 04 '18 at 19:47
  • @DourHighArch Maybe there's some reason that kept him from being able to reference Microsoft.VisualBasic namespace? – fourwhey Oct 04 '18 at 21:28
1

I haven't tried with numerous lines, but this would be my first approach:

namespace ConsoleTestApplication
{
    class Program
    {
        static void Main(string[] args)
        {
            var before = "4122,6384,0,\"20,000 BONUS POINTS\",,TRUE";
            var pattern = @"""[^""]*""";
            var after = Regex.Replace(before, pattern, match => match.Value.Replace(",", ""));
            Console.WriteLine(after);
        }
    }
}
Michael Staples
  • 537
  • 7
  • 13