3

How to remove ,(comma) which is between "(double inverted comma) and "(double inverted comma). Like there is "a","b","c","d,d","e","f" and then from this, between " and " there is one comma which should be removed and after removing that comma it should be "a","b","c","dd","e","f" with the help of the regex in C# ?

EDIT: I forgot to specify that there may be double comma between quotes like "a","b","c","d,d,d","e","f" for it that regex does not work. and there can be any number of comma between quotes.

And there can be string like a,b,c,"d,d",e,f then there should be result like a,b,c,dd,e,f and if string like a,b,c,"d,d,d",e,f then result should be like a,b,c,ddd,e,f.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Harikrishna
  • 4,185
  • 17
  • 57
  • 79

6 Answers6

11

Assuming the input is as simple as your examples (i.e., not full-fledged CSV data), this should do it:

string input = @"a,b,c,""d,d,d"",e,f,""g,g"",h";
Console.WriteLine(input);

string result = Regex.Replace(input,
    @",(?=[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)",
    String.Empty);
Console.WriteLine(result);

output:

a,b,c,"d,d,d",e,f,"g,g",h
a,b,c,"ddd",e,f,"gg",h

The regex matches any comma that is followed by an odd number of quotation marks.


EDIT: If fields are quoted with apostrophes (') instead of quotation marks ("), the technique is exactly the same--except you don't have to escape the quotes:

string input = @"a,b,c,'d,d,d',e,f,'g,g',h";
Console.WriteLine(input);

string result = Regex.Replace(input,
    @",(?=[^']*'(?:[^']*'[^']*')*[^']*$)",
    String.Empty);
Console.WriteLine(result);

If some fields were quoted with apostrophes while others were quoted with quotation marks, a different approach would be needed.


EDIT: Probably should have mentioned this in the previous edit, but you can combine those two regexes into one regex that will handle either apostrophes or quotation marks (but not both):

@",(?=[^']*'(?:[^']*'[^']*')*[^']*$|[^""]*""(?:[^""]*""[^""]*"")*[^""]*$)"

Actually, it will handle simple strings like 'a,a',"b,b". The problem is that there would be nothing to stop you from using one of the quote characters in a quoted field of the other type, like '9" Nails' (sic) or "Kelly's Heroes". That's taking us into full-fledged CSV territory (if not beyond), and we've already established that we're not going there. :D

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • @Alan Moore, Great sir.. It works for all the options..Thank you..It's really great. – Harikrishna Mar 05 '11 at 07:25
  • @Alan Moore, If there is ' instead of " then can we do anything ? Like 'a','b','c','d,d','e','f' and a,b,c,'d,d',e,f – Harikrishna Mar 05 '11 at 07:41
  • @Alan Moore,Ok.But can be there common regex for both like 'a','b','c','d,d','e','f' and "a","b","c","d,d","e","f" ? – Harikrishna Mar 05 '11 at 09:02
  • 1
    @Harikrishna - That probably isn't a good idea, and may not be possible, as it makes the problem ambiguous. What if you have the string `'1,2'` as input? If `'` is your quote character, your expected output is `'12'`. If `"` is your quote character, your expected output is `'1` and `2'`. The regex cannot *know* which one is correct - it's up to you to choose the right one. – Kobi Mar 05 '11 at 09:20
  • @Kobi, No I have need of same output for both of string like if there is "1,2" then "12" and if there is '1,2' then also '12'. – Harikrishna Mar 05 '11 at 09:23
  • @Harikrishna - so your string cannot contain `"` or `'`? What about `1,'2',"3"` - is that the same as `"1","'2'","3"`, the same as `'1','2','"3"'`, or invalid? Can you please edit the question to include all of these requirements? – Kobi Mar 05 '11 at 09:24
  • @Kobi, Like string can be any of them : It can be `"a","b","c","d,d","e","f"` then result should be `"a","b","c","dd","e","f"` and if `'a','b','c','d,d','e','f'` is a string then result should be `'a','b','c','dd','e','f'`. – Harikrishna Mar 05 '11 at 09:32
  • @Alan Moore, It is now solved and work perfectly by using both of regex answered by you.Thank you very very much for your great support. – Harikrishna Mar 05 '11 at 09:47
  • @Alan Moore,If there is string input = @"a,b,c,'d,e,f,g,h,i,j,k" then the regex @",(?=[^']*'(?:[^']*'[^']*')*[^']*$)" does not work perfectly, it delete comma between character a and d even if it is not enclosed with quotes. So what can be the regex ? – Harikrishna Jun 30 '11 at 07:42
  • @Hari: You mean, if there's only one quote in the whole string? Or three quotes, five quotes, etc.? You would need to check that condition before you start the replacements. For example, `Regex.isMatch(input, @"^(?:[^']*'[^']*')+[^']*$")` will tell you if the number of quotes in the string is even and there are at least two of them. – Alan Moore Jun 30 '11 at 09:31
5

They're called regular expressions for a reason — they are used to process strings that meet a very specific and academic definition for what is "regular". It looks like you have some fairly typical csv data here, and it happens that csv strings are outside of that specific definition: csv data is not formally "regular".

In spite of this, it can be possible to use regular expressions to handle csv data. However, to do so you must either use certain extensions to normal regular expressions to make them Turing complete, know certain constraints about your specific csv data that is not promised in the general case, or both. Either way, the expressions required to do this are unwieldly and difficult to manage. It's often just not a good idea, even when it's possible.

A much better (and usually faster) solution is to use a dedicated CSV parser. There are two good ones hosted at code project (FastCSV and Linq-to-CSV), there is one (actually several) built into the .Net Framework (Microsoft.VisualBasic.TextFieldParser), and I have one here on Stack Overflow. Any of these will perform better and just plain work better than a solution based on regular expressions.

Note here that I'm not arguing it can't be done. Most regular expression engines today have the necessary extensions to make this possible, and most people parsing csv data know enough about the data they're handling to constrain it appropriately. I am arguing that it's slower to execute, harder to implement, harder to maintain, and more error-prone compared to a dedicated parser alternative, which is likely built into whichever platform you're using, and is therefore not in your best interests.

Community
  • 1
  • 1
Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • 1
    Is CSV really isn't regular? Assuming we don't count the entries in each line, can't you easily validate it? – Kobi Mar 05 '11 at 09:16
  • @Kobi - you can't using regex, because of the possibility of nesting quotes and commas within a field with no defined limit for depth. You either need an artificial limit on depth, or you need a regex extension that supports recursion. Either way, you end up with complicated, slow expressions. A dedicated csv parser is just better. – Joel Coehoorn Mar 05 '11 at 19:45
  • What nesting is there in CSV? You need to match one level of quotes - inside them there could be anything, including newlines, commas and escaped quotes, but I don't think there's a second level or more (unless we're talking about an n-dimensional CSVs, but I'm being silly). Of course, I completely agree a CSV parser is usually better, if that's what you need (e.g., if the OP is trying to hack the input into being suitable for `split(',')`, but there's no indication that's the case)`. – Kobi Mar 05 '11 at 20:40
  • @Kobi - think of a field like this: `"a,"b,c,"d,e,f",g""` where you have csv data inside csv fields. This is a simple example and your expressions likely handles it ( I didn't check), but this could be nested indefinitely. – Joel Coehoorn Mar 05 '11 at 20:51
  • I get it. We're talking about different CSVs. As far as I know, CSV is used for a flat table, and literal quotes in values are escaped using `""`: http://en.wikipedia.org/wiki/Comma-separated_values . You're talking about a JSON-like CSV, which is interesting, and indeed has unlimited nesting. – Kobi Mar 05 '11 at 20:55
1

You can use this:

var result = Regex.Replace(yourString, "([a-z]),", "$1");

Sorry, after seeing your edits, regular expressions are not appropriate for this.

Jon
  • 4,925
  • 3
  • 28
  • 38
  • @Jon Freeland,It works for the string described in question but sorry I forgot to specify that there may be double comma between quotes like "a","b","c","d,d,d","e","f" for it that regex does not work. and there can be any number of commma between quotes. – Harikrishna Mar 05 '11 at 06:06
  • @Jon Freeland,It does not work for "a","b","c","d,d,d","e","f". – Harikrishna Mar 05 '11 at 06:16
  • @Jon,but if string will be like a,b,c,"d,d",e,f then it will replaces for all comma. – Harikrishna Mar 05 '11 at 06:21
  • @Rebecca is correct, turns out regex are not suitable in this situation. – Jon Mar 05 '11 at 06:41
  • @Jon Freeland, then is there any other solution like can be there two regex for "a","b","c","d,d","e","f" and a,b,c,"d,d",e,f ? Like for the first string, regex you specify works and for second type of string is there any other regex to remove comma ? Can we do like first edit that second string and make it like first string like by adding "" to each value a,b,c,"d,d",e,f become "a","b","c","d,d","e","f" and then use the same regex can we do like that ? – Harikrishna Mar 05 '11 at 06:52
  • @Harikrishna, if it's guaranteed that each letter will be surrounded by quotes, then the regex I have provided should work fine. However, please consider @Joel's answer. – Jon Mar 05 '11 at 06:56
  • @Jon Freeland,But if the string is like `a,b,c,"d,d",e,f` then if I want to make it "a","b","c","d,d","e","f" by adding quotes to each filed can it be done by regex ? – Harikrishna Mar 05 '11 at 06:58
1
var input = "\"a\",\"b\",\"c\",\"d,d\",\"e\",\"f\"";
var regex = new Regex("(\"\\w+),(\\w+\")");
var output = regex.Replace(input,"$1$2");
Console.WriteLine(output);

You'd need to evaluate whether or not \w is what you want to use.

Rebecca Chernoff
  • 22,065
  • 5
  • 42
  • 46
1

This should be very simple using Regex.Replace and a callback:

string pattern = @"
""      # open quotes
[^""]*  # some not quotes
""      # closing quotes
";
data = Regex.Replace(data, pattern, m => m.Value.Replace(",", ""),
    RegexOptions.IgnorePatternWhitespace);

You can even make a slight modification to allow escaped quotes (here I have \", and the comments explain how to use "":

string pattern = @"
\\.     # escaped character (alternative is be """")
|
(?<Quotes>
    ""              # open quotes
    (?:\\.|[^""])*  # some not quotes or escaped characters
                      # the alternative is (?:""""|[^""])*
    ""              # closing quotes
)
";
data = Regex.Replace(data, pattern,
            m => m.Groups["Quotes"].Success ? m.Value.Replace(",", "") : m.Value,
            RegexOptions.IgnorePatternWhitespace);

If you need a single quote replace all "" in the pattern with a single '.

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • but there should be a solution which solve both of string like 'a','b','c','d,d','e','f' and "a","b","c","d,d","e","f" or there should be a common regex which works for both of string. – Harikrishna Mar 05 '11 at 09:20
  • @Harikrishna - I've just added a comment on Alan's answer, explaining why I don't think that's very likely. – Kobi Mar 05 '11 at 09:23
  • It is now solved and work perfectly by using both of regex which is in answer of Alan Moore sir and also regex answered by you.Thank you very very much for your great support. – Harikrishna Mar 05 '11 at 09:46
-1

Something like the following, perhaps?

"(,)"

Brian
  • 3,850
  • 3
  • 21
  • 37