6

I am trying to read CSV file that has cell with multiple rows inside it.

This is how the CSV looks like:
enter image description here

row 1, column 'Detail' has multiple lines.

When I am trying to read it using ReadLine() method:

private void buttonBrowse_Click(object sender, EventArgs e)
        {
            openFileDialog.Filter = "Excel Worksheets|*.csv";
            if (openFileDialog.ShowDialog() == DialogResult.OK)
            {
                String filePathAndName = openFileDialog.FileName;
                StreamReader reader = new StreamReader(filePathAndName);
                String line = reader.ReadLine();
                Console.WriteLine(line);
                do
                {
                     line = reader.ReadLine();
                     Console.WriteLine(line);
                } while (line != null);
            }
        }

it splits the cell with the multiple rows to number of rows:

[1]"Time of Day","Process Name","PID","Operation","Path","Result","Detail","Image Path"
[2]"22:52:24.2905182","notepad.exe","4828","Process Start","","SUCCESS","Parent PID: 2484, Command line: ""C:\Windows\system32\notepad.exe"" , Current directory: C:\Users\User\, Environment: 
[3];    =::=::\
[4];    ALLUSERSPROFILE=C:\ProgramData
[5];    APPDATA=C:\Users\User\AppData\Roaming
[6];    asl.log=Destination=file
[7];    CommonProgramFiles=C:\Program Files\Common Files
...

"22:52:24.2905201","notepad.exe","4828","Thread Create","","SUCCESS","Thread ID: 8008","C:\Windows\system32\notepad.exe"
"22:52:24.2915842","notepad.exe","4828","Load Image","C:\Windows\System32\notepad.exe","SUCCESS","Image Base: 0x6f0000, Image Size: 0x30000","C:\Windows\system32\notepad.exe"

in the above logs rows 2-7 should be one row.

I want to read it like powershell did it nicely here using import-csv function:
enter image description here

And you can easily pull data from specific cell by its row and column using the command (example):

$csvContent[0] |select -expand Detail

Example:
enter image description here

Ryan Kohn
  • 13,079
  • 14
  • 56
  • 81
E235
  • 11,560
  • 24
  • 91
  • 141

3 Answers3

8

Instead of manually reading in the lines, you can use a library like CsvHelper, which will remove a lot of the headache from parsing a csv.

Ryan Kohn
  • 13,079
  • 14
  • 56
  • 81
  • I came in to say this. There are so many unforeseen issues with csv parsing that simply trying to split on a comma (or whatever delimiter) isn't reliable. That's why people have created libraries like CsvHelper. – valverij Nov 07 '14 at 19:56
0

I know this is not a pretty way to do it, but it Works in my case:

lineCounter = 0;
while (!reader.EndOfStream)
{
     var line = reader.ReadLine();
     var values = line.Split(',');

     if(values.Length == 1)
     {
        list4[lineCounter-1] += values[0];
     }
     else
     {
          list1.Add(values[0]);
          list2.Add(values[1]);
          list3.Add(values[2]);
          list4.Add(values[3]);
          lineCounter++;
     }

}
0

.Net does not yet offer a standard library to read CSV.

Although the CSV specification is relatively simple, parsing a csv with multi-line data is not exactly trivial.

Some people "cheat" with Regular Expression, but then you need to read the whole file into string, since the regex cannot pull in more lines on demand, and you still need to detect and handle row breaks. And that is before we measure its performance, conformant, or your new problem.

The standard recommendation is to use a well-tested parser package. CsvHelper is pretty comprehensive, and I would suggest NReco.Csv if you just want to read raw data.


With that said, sometimes you may not prefer a package or have restricted options. Whatever the reason, I have written a csv parser in a few static methods that you can copy and paste into your project and get going.

Usage:

using ( var r = new StreamReader( filePath, Encoding.UTF8, true ) ) {
   while ( r.TryReadCsvRow( out var row ) ) {
      foreach ( string cell in row ) {
         // Your code here.
      }
   }
}

using ( var r = new StringReader( csvString ) ) {
   while ( r.TryReadCsvRow( out var row ) ) {
      string[] cells = row.ToArray();
      // `cells` is reusable and random-accessible
   }
}

Parser Code:

/**
 * <summary>Try read a csv row from a Reader.  May consume multiple lines.  Linebreaks in cells will become \n</summary>
 * <param name="source">Reader to get line data from.</param>
 * <param name="row">Cell data enumeration (forward-only), or null if no more rows.</param>
 * <param name="quoteBuffer">Thread-local buffer for quote parsing. If null, one will be created on demand.</param>
 * <returns>True on success, false on no more rows.</returns>
 * <see cref="StreamReader.ReadLine"/>
 */
public static bool TryReadCsvRow ( this TextReader source, out IEnumerable<string> row, StringBuilder quoteBuffer = null ) {
   row = ReadCsvRow( source, quoteBuffer );
   return row != null;
}

/**
 * <summary>Read a csv row from a Reader.  May consume multiple lines.  Linebreaks in cells will become \n</summary>
 * <param name="source">Reader to get line data from.</param>
 * <param name="quoteBuffer">Thread-local buffer for quote parsing. If null, one will be created on demand.</param>
 * <returns>Cell data enumeration (forward-only), or null if no more rows.</returns>
 * <see cref="StreamReader.ReadLine"/>
 */
public static IEnumerable<string> ReadCsvRow ( this TextReader source, StringBuilder quoteBuffer = null ) {
   var line = source.ReadLine();
   if ( line == null ) return null;
   return ReadCsvCells( source, line, quoteBuffer );
}

private static IEnumerable<string> ReadCsvCells ( TextReader source, string line, StringBuilder buf ) {
   for ( var pos = 0 ; line?.Length >= pos ; )
      yield return ReadCsvCell( source, ref line, ref pos, ref buf );
}

private static string ReadCsvCell ( TextReader source, ref string line, ref int pos, ref StringBuilder buf ) {
   var len = line.Length;
   if ( pos >= len ) { // EOL
      pos = len + 1;
      return "";
   }

   // Unquoted cell.
   if ( line[ pos ] != '"' ) {
      var end = line.IndexOf( ',', pos );
      var head = pos;
      // Last cell in this row.
      if ( end < 0 ) {
         pos = len + 1;
         return line.Substring( head );
      }
      // Empty cell.
      if ( end == pos ) {
         pos++;
         return "";
      }
      pos = end + 1;
      return line.Substring( head, end - head );
   }

   // Quoted cell.
   if ( buf == null )
      buf = new StringBuilder();
   else
      buf.Clear();
   var start = ++pos; // Drop opening quote.
   while ( true ) {
      var end = pos < len
         ? line.IndexOf( '"', pos )
         : -1;
      var next = end + 1;

      // End of line.  Append and read next line.
      if ( end < 0 ) {
         buf.Append( line, start, len - start );
         if ( ( line = source.ReadLine() ) == null )
            return buf.ToString();
         buf.Append( '\n' );
         start = pos = 0; len = line.Length;

       // End of cell.
      } else if ( next == len || line[ next ] == ',' ) {
         pos = end + 2;
         return buf.Append( line, start, end - start ).ToString();

      // Two double quotes.
      } else if ( line[ next ] == '"' ) {
         buf.Append( line, start, end - start + 1 );
         pos = start = end + 2;

      // One double quote not followed by EOL or comma.
      } else
         pos++;
   }
}

Pros

  • Low overhead, works with big files. (e.g. 800mb census data)
  • Works with all line breaks, parse quoted cells.
  • Optional buffer to improve quote parsing speed.
  • Thread safe, if buffer is not shared between threads. No locking.
  • No dependency. No Nuget. Works in all modern .Net.

Cons

  • All line breaks will be converted to \n.
  • Output is forward-only / use-once. Solve with ToArray or ToList.
  • Buffer, if provided, is not cleared after read.
Sheepy
  • 17,324
  • 4
  • 45
  • 69
  • 1
    Probably high performance but the code seems unreadable (and thus unmaintainable) to me. Reformatting might do the trick ;-) Also - why not put it in a class? – Markus Safar Dec 30 '21 at 11:06
  • @MarkusSafar I admit I don't understand. It is trivial for VS(C) to format my code to your preference, with you having total control over it. I know I eagerly format copied code to read more in less scroll. :D As for class, you know best where you want to place them in your code - you won't like how I abuse them with compiler flags. :) The aim is to minimise footprint, to have a feather-light code that you can paste and use anywhere, without any bells and whistles, class included. – Sheepy Jan 01 '22 at 16:52
  • I disagree because in my opinition creating a bunch of static methods reuires refactoring before using them. Creating a class reuqires just copying the class and using (instantiating) it. Also the compiler needts to compile the code anyway so why not paste it in a human readable and maintainable form. – Markus Safar Jan 01 '22 at 18:43
  • @MarkusSafar If I count as a human then it is human readable and maintainable. There is no universal style and VS have built-in code formatters if you prefer to scroll. The names should be meaningful. I love design patterns, but sometimes I want KISS. If you need extensible classes you would be better off using a package, one of which I have linked to in my post. I don't think NReco's code is simpler than mine, but it is more configurable and extensible. Just trying to provide an alternative to packages and string.Split one-liners in every other csv answers. – Sheepy Jan 04 '22 at 05:34
  • Just because you and I can read your code does not mean it is easily readable in general. In my opinion the goal should not be to just "throw a bunch of methods somewhere" but to create easily readable and understandable code. If code can easily be read and thus understood it will most likely be also maintainable code. – Markus Safar Jan 04 '22 at 06:51
  • 1
    Of course everyone can reformat what you've pasted here but the question is why you are forcing the work to the reader? Why not provide a good answer that is "ready to be taken" instead of providing something that is not? Or would you paste this code into your OOP project as it is now? – Markus Safar Jan 04 '22 at 06:52
  • 1
    There's a lot here — notes about dependencies and Nuget, numberless claims of high performance and low overhead, CSV-writing capabilities — that wasn't a concern in the question, and yet there's no real mention of how this goes about actually answering the question other than an implied "Just copy-and-paste this code." If you trim all that extra stuff away this becomes, in my opinion, a code-only answer. I also agree with @MarkusSafar about the formatting, that what seems to be an aversion to newlines makes this harder to read and follow. – Lance U. Matthews Jan 05 '22 at 22:39
  • @LanceU.Matthews Noted and amended, with performance and writer removed, code reformatted. For what it's worth, when tested on 817mb census (no quotes) and 513mb l10n (multiline), about -60% time vs CSVHelper, +10% time vs NReco.csv. File loaded to memory, 1 warmup + 5 runs, loop count cell count and sum data length. – Sheepy Jan 06 '22 at 10:11
  • @MarkusSafar it is the reverse. The code is copied verbatim from a [project](https://github.com/Sheep-y/Automodchef/blob/main/ZyMod.cs), oop in the sense that most classes are either abstract or super. Yes, I eat my own sheep food. But I consider this code mostly done, so I yield. No, while I use this spacious format at work, it is not more maintainable to me. Anyway, hope this helps resolve the style disagreement for happier humans. – Sheepy Jan 06 '22 at 10:53