0

I have a csv file which has a header row and 2 data rows but from source sometimes we get data rows split into multiple rows which in below case makes it 3 rows.

enter image description here

So my SSIS package gets failed when processing this csv file through C# code as it treats "ETF" line as a new row. I am currently using below code to add all of the data to an arraylist.

            string sLine;
            ArrayList arrText = new ArrayList();
            StreamReader objReader = new StreamReader(filepath);

            do
            {
                sLine = objReader.ReadLine();
                if (sLine != null)
                    arrText.Add(sLine);
            }
            while (sLine != null);

I am new to C#, so can anyone please suggest how we can handle this sort of scenario so that my code can treat this file 2 have only 2 rows instead of 3 and load it into database.

Himsy
  • 57
  • 6
  • Have you tried a simple text replace? – gunr2171 Jan 05 '21 at 14:34
  • 2
    It looks like you want to replace any CRLF which is inside double-quote characters `"`. That's easy enough (a regex is probably the way to go), but what happens if a `"` appears within a field, and how is that escaped? That might mess up your logic. – canton7 Jan 05 '21 at 14:37
  • SSIS if you use built in import there is an option for Number of lines to skip. Or mark your file as text qualified (not sure how C# to do this) but text qualified files should ignore anything between the qualifiers, at least in SSIS data flow tasks, though not 100% if it ignores line breaks. – Brad Jan 05 '21 at 14:40
  • @canton7 I will say that a correctly formed Regex to parse a CSV is quite complex, especially if you consider multiline text (text split in at least 3 lines) and escaped doublequotes (""). I would parse the text with a correctly-done csv library, replace the newline with space, and rewrite the csv – xanatos Jan 05 '21 at 14:51
  • @canton7 - Yes I need to replace any CRLF which is inside double-quote characters.. but even if i use replace to do that, the next task would be to have "ETF" which is in 3rd line to be a part of 2nd line so that C# knows there are only 2 rows and imports that into DB and doesn't fail – Himsy Jan 05 '21 at 14:51
  • @xanatos Agreed, which is why I explicitly mentioned that escaped double quotes will add significant complexity – canton7 Jan 05 '21 at 14:52

3 Answers3

0

Use Microsoft.VisualBasic.FileIO.TextFieldParser to read the csv file or refer this or this

Sarang M K
  • 261
  • 3
  • 9
0

Now... I don't feel this is a very good idea. A good idea would be to use one of the CSV libraries for .net, load the file, replace the newlines with space, re-write the file as csv.

This isn't a good idea. This is a regex-idea.

string csv = File.ReadAllText("sample.csv");

string separator = "#";
string escapedDoubleQuotes = "\"\"";
var rx = new Regex($@"(?<=^ *|{Regex.Escape(separator)} *)""({Regex.Escape(escapedDoubleQuotes)}|[^""])*""", RegexOptions.Multiline);
var replacer = new Regex(" *[\r\n]+ *");

string csv2 = rx.Replace(csv, x => replacer.Replace(x.Value, " "));

File.WriteAllText("output.csv", csv2);

It will replace sequences of \r and/or \n (plus optional spaces around) with a single space. The regex will try to find double-quoted strings in the csv, "extract" them and then do some replacing (using a second regex, called replacer) with a single space. There is even support for escaped double quotes. By default to insert a double quote " in a value you'll have to "" double it.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Thanks for the code. I tried to your code in C# 2015 and placed a debugger over it to check values but SSIS is not letting me debug values, instead just completing the script task. Could it be a string interpolation issue i.e. using $ in above regex and if yes, can you please update your script – Himsy Jan 06 '21 at 14:52
0

Assuming each line starts with a date and no other field is a date, you could simply read all the text as one string and remove each \r\n that isn't immediately followed by a date:

string csv = File.ReadAllText(filename);
csv = Regex.Replace(csv, "\r\n(?!(\"[0-9]+/))", "$1");
arrtext.AddRange(csv.Split("\r\n"));

This is not a complete pattern for the given date format, but works for the example.

YallowRvn
  • 31
  • 4