2

I'm parsing CSV data using Microsoft.VisualBasic.FileIO.TextFieldParser. It's very good compared to the freeware libraries I've found for parsing CSV. It does everything that I think it should WRT CSV except that it does not preserve the leading/trailing spaces of a field that is enclosed in quotes. Well, it does if I set TrimWhiteSpace to false, but then it doesn't trim the spaces from fields not enclosed in quotes. For CSV I want it to trim non-quoted fields and not trim the quoted fields.

This is how I'm using the class:

  var parser = new TextFieldParser(textReader) {Delimiters = new[] {","}};
  //TrimWhiteSpace is true by default
  var row1 = _textFieldParser.ReadFields();
  var row2 = _textFieldParser.ReadFields();

Consider this data:

 1 , 2 
" 1 ", " 2 "

For TrimWhiteSpace==true, both row1 and row2 are ["1", "2"]. For TrimWhiteSpace==false, both row1 and row2 are [" 1 ", " 2 "].

What I want is row1==["1", "2"] and row2==[" 1 ", " 2 "].

steve
  • 1,021
  • 1
  • 14
  • 29
  • I read the docs and searched the web (which I consider to go without saying for this site). I tried various combinations of code using the library as I described. What are you getting at? That you don't think it's a good question? – steve Dec 15 '15 at 20:21
  • stumbled across this whilst searching for the same answer. Have flagged @EngineerDollery's last comment for removal as it violates the code of conduct here at SO, something I'd have hoped they'd have realised based on them providing "advice" to steve... – Sk93 Aug 15 '18 at 14:17

1 Answers1

0

Although quite late to answer, found the question interesting and up-voted because IMO it's surprising there's no built-in way to keep white space under the described conditions.

So assuming the same input as the question, with an added line to also keep the double quote escape character (an immediately following double quote):

1 , 2 
" 1 ", " 2 "
" a ""quoted"" word ", " hello world "

Set HasFieldsEnclosedInQuotes to false, and deal with any field that is enclosed in quotes using a simple Regex:

var separator = new string('=', 40);
Console.WriteLine(separator);
// demo only - show the input lines read from a text file 
var text = File.ReadAllText(inputPath);
var lines = text.Split(
    new string[] { Environment.NewLine }, 
    StringSplitOptions.None
);

using (var textReader = new StringReader(text))
{
    using (var parser = new TextFieldParser(textReader))
    {
        parser.TextFieldType = FieldType.Delimited;
        parser.SetDelimiters(",");
        parser.TrimWhiteSpace = true;
        parser.HasFieldsEnclosedInQuotes = false;
        // remove double quotes, since HasFieldsEnclosedInQuotes is false
        var regex = new Regex(@"
        # match double quote 
        \""    
        # if not immediately followed by a double quote
        (?!\"")
        ",
            RegexOptions.IgnorePatternWhitespace
        );

        var rowStart = 0;
        while (parser.PeekChars(1) != null)
        {
            Console.WriteLine(
                "row {0}: {1}", parser.LineNumber, lines[rowStart]
            );
            var fields = parser.ReadFields();
            for (int i = 0; i < fields.Length; ++i)
            {
                Console.WriteLine(
                    "parsed field[{0}] = [{1}]", i,
                    regex.Replace(fields[i], "")
                );
            }
            ++rowStart;
            Console.WriteLine(separator);
        }
    }
}

OUTPUT:

========================================
row 1: 1 , 2
parsed field[0] = [1]
parsed field[1] = [2]
========================================
row 2: " 1 ", " 2 "
parsed field[0] = [ 1 ]
parsed field[1] = [ 2 ]
========================================
row 3: " a ""quoted"" word ", " hello world "
parsed field[0] = [ a "quoted" word ]
parsed field[1] = [ hello world ]
========================================
kuujinbo
  • 9,272
  • 3
  • 44
  • 57
  • Thanks for the effort. But, with HasFieldsEnclosedInQuotes=false the parser does not ignore a comma in a field. For example, "a,b",c results in [a] [b] [c] but should be [a,b] [c]. – steve Feb 19 '16 at 16:41
  • I'm pretty sure that there is no way to do what I want with the parser ... that the parser has a fatal flaw (bug). But, I thought I'd ask around anyway. Thanks for trying. – steve Feb 19 '16 at 16:59