2

I have a CSV file that has rows resembling this:

1,  4,     2, "PUBLIC, JOHN Q" ,ACTIVE , 1332

I am looking for a regular expression replacement that will match against these rows and spit out something resembling this:

1,4,2,"PUBLIC, JOHN Q",ACTIVE,1332

I thought this would be rather easy: I made the expression ([ \t]+,) and replaced it with ,. I made a complement expression (,[ \t]+) with a replacement of , and I thought I had achieved a good means of right-trimming and left-trimming strings.

...but then I noticed that my "PUBLIC, JOHN Q" was now "PUBLIC,JOHN Q" which isn't what I wanted. (Note the space following the comma is now gone).

What would be the appropriate expression to trim the white space before and after a comma, but leave quoted text untouched?

UPDATE

To clarify, I am using an application to handle the file. This application allows me to define multiple regular expression replacements; it does not provide a parsing capability. While this may not be the ideal mechanism for this, it would sure beat making another application for this one file.

Jeremy Holovacs
  • 22,480
  • 33
  • 117
  • 254
  • 6
    Why don't you parse? – Bart Friederichs Oct 21 '13 at 15:27
  • 6
    Use a CSV parser - don't try to reinvent the wheel! Check out [Microsoft.VisualBasic.FileIO.TextFieldParser](http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx). – Tim Oct 21 '13 at 15:28
  • I would parse the file and build it back together. No need for RegEx which I don't funny understand. – Black Frog Oct 21 '13 at 15:28
  • This [answer](http://stackoverflow.com/a/4829787/2145211) looks good. You could split it using a CSV parser then join it back up – Harrison Oct 21 '13 at 15:29
  • I am using a previously built application that allows for regular expression transforms. Parsing would require a custom built transform that I would rather avoid building at this point. – Jeremy Holovacs Oct 21 '13 at 15:29
  • 8
    You will have a monstrosity of a RegEx, as it'll have to account for detecting whether it is between quotes, and whether those quotes have been escaped or not. As the other commenters say: parse, then rebuild without whitespace. – CodeCaster Oct 21 '13 at 15:29
  • 1
    Probably wrong tool for the job: http://programmers.stackexchange.com/questions/166454/can-the-csv-format-be-defined-by-a-regex – Bart Friederichs Oct 21 '13 at 15:31
  • Unfortunately, that is not really an option for me. A monstrosity regex is better than creating a new application for this one file. – Jeremy Holovacs Oct 21 '13 at 15:34
  • This is a .NET / C# question, right? So you are writing C# code... how would a "new application" be necessary for a parser-based solution, while using regex would work? – Tomalak Oct 21 '13 at 15:37
  • @JeremyHolovacs If you don't want a programming answer you shouldn't have tagged it with C# and .Net. – Ralf Oct 21 '13 at 15:37
  • 1
    @Ralf I tagged it with C# and .Net because it has a different flavor of regex than perl/ php/ java/ javascript; I just wanted to clarify. The application I am using is an internal app written in C#. – Jeremy Holovacs Oct 21 '13 at 15:41

5 Answers5

2

If the engine used by your tool is the C# regular expression engine, then you can try the following expression:

(?<!,\s*"(?:[^\\"]|\\")*)\s+(?!(?:[^\\"]|\\")*"\s*,)

replace with empty string.

The guys answers assumed the quotes are balanced and used counting to determine if the space is part of a quoted value or not.

My expression looks for all spaces that are not part of a quoted value.

RegexHero Demo

Ibrahim Najjar
  • 19,178
  • 4
  • 69
  • 95
  • Hi @Sniffer what does (?<!, mean? (I am relatively new to regular expression and don't know the meaning of < and ! in a group (the ? stands for non capturing group right?) – lucacerone Feb 25 '15 at 09:53
  • @lucacerone it's been a while since I worked with regular expressions so you need to check C# documentation to be sure, but I remember this is a negative lookbehind. – Ibrahim Najjar Feb 25 '15 at 11:58
1

Going with some CSV library or parsing the file yourself would be much more easier, and IMO should be preferable option here.

But if you really insist on a regex, you can use this one:

"\s+(?=([^\"]*\"[^\"]*\")*[^\"]*$)"

And replace it with empty string - ""

This regex matches one or more whitespaces, followed by an even number of quotes. This will of course work only if you have balanced quote.

(?x)       # Ignore Whitespace
\s+        # One or more whitespace characters       
(?=        # Followed by 
   (          # A group - This group captures even number of quotes
     [^\"]*     # Zero or more non-quote characters
     \"         # A quote
     [^\"]*     # Zero or more non-quote characters
     \"         # A quote 
   )*         # Zero or more repetition of previous group
   [^\"]*     # Zero or more non-quote characters
   $          # Till the end
)          # Look-ahead end
Rohit Jain
  • 209,639
  • 45
  • 409
  • 525
1

Something like this might do the job:

(?<!(^[^"]*"[^"]*(("[^"]*){2})*))[\t ]*,[ \t]*

Which matches [\t ]*,[ \t]*, only when not preceded by an odd number of quotes.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
0
        string format(string val)
        {
            if (val.StartsWith("\"")) val = " " + val;
            string[] vals = val.Split('\"');
            for (int i = 0; i < vals.Length; i += 2) vals[i] = vals[i].Replace(" ", "").Replace("\t", "");
            return string.Join("\t", vals);
        }

This will work if you have properly closed quoted strings in between

Abdul Saleem
  • 10,098
  • 5
  • 45
  • 45
0

Forget the regex (See Bart's comment on the question, regular expressions aren't suitable for CSV).

public static string ReduceSpaces( string input )
{
    char[] a = input.ToCharArray();
    int placeComma = 0, placeOther = 0;
    bool inQuotes = false;
    bool followedComma = true;
    foreach( char c in a ) {
        inQuotes ^= (c == '\"');
        if (c == ' ') {
            if (!followedComma)
                a[placeOther++] = c;
        }
        else if (c == ',') {
            a[placeComma++] = c;
            placeOther = placeComma;
            followedComma = true;
        }
        else {
            a[placeOther++] = c;
            placeComma = placeOther;
            followedComma = false;
        }
    }
    return new String(a, 0, placeComma);
}

Demo: http://ideone.com/NEKm09

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720