0

I have description field which is:

16" Alloy Upgrade 

In CSV format it appears like this:

"16"" Alloy Upgrade "

What would be the best use of regex to maintain the original format? As I'm learning I would appreciate it being broke down for my understanding.

I'm already using Regex to split some text separating 2 fields which are: code, description. I'm using this:

 ,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))

My thoughts are to remove the quotes, then remove the delimiter excluding use in sentences.

Thanks in advance.

DarkShadowAY
  • 175
  • 4
  • 15
  • 1
    How are you forming the CSV file ? I would think that the double quote would get escaped `"16\" Alloy Upgrade"`. – HamZa Mar 03 '14 at 15:50
  • 1
    @HamZa: From memory (and its been a while) what he has showed is escaped with the standard doubling up of quotes. – Chris Mar 03 '14 at 15:53
  • @HamZa Just simply exporting through excel and saving as CSV file. After opening it with notepad++, I see the additional characters. – DarkShadowAY Mar 03 '14 at 15:53
  • What do you mean by "maintaining the original format"? What you seem to be having issue with is the fundamentals of how CSVs escape things. They quote delimit stuff to take into account commas (when commas are in quotes they are not separators) and then they need to escape the quotes which they do as above. Are you trying to then manually read these CSV files without a proper CSV parser? – Chris Mar 03 '14 at 15:54
  • The key thing is that CSV files are mostly human readable but you should not expect the text in them to be the same as the plain text. – Chris Mar 03 '14 at 15:55
  • @Chris I mean just having the text display as it shows without the doubling up of quotes. Maybe it would help if I explain what I'm trying to do. I'm working through this tutorial: https://arranmaclean.wordpress.com/2010/07/20/net-mvc-upload-a-csv-file-to-database-with-bulk-upload/ in importing a csv file to sql using MVC. The problem I'm having is that within SQL, I'm getting "16"" Alloy Upgrade " as opposed to 16" Alloy Upgrade – DarkShadowAY Mar 03 '14 at 15:57
  • @DarkShadowAY: OK. The problem is that that tutorial helpful remembers the need to do clever things with commas but not with quotes. My personal approach would be to just do it with string functions (eg once you've split take the first and last characters off if they are quotes and then just replace all `""` with `"`). Much more readable usually than regular expressions. :) – Chris Mar 03 '14 at 16:03
  • Trim (start and end) and use a replace for double? That's a potential workaround, good call Chris – DarkShadowAY Mar 03 '14 at 16:10
  • Use some CSV parser that should return values to you unescaped according to CSV rules, rather than parsing yourself: See [SO post here](http://stackoverflow.com/questions/906841/csv-parser-reader-for-c), or [here](http://stackoverflow.com/questions/2081418/parsing-csv-files-in-c-sharp) – LB2 Mar 03 '14 at 15:56

2 Answers2

1

If you don't want to/can't use a standard CSV parser (which I'd recommend), you can strip all non-doubled quotes using a regex like this:

Regex.Replace(text, @"(?!="")""(?!"")",string.Empty)

That regex will match every " character not preceded or followed by another ".

Steve Ruble
  • 3,875
  • 21
  • 27
  • This doesn't work perfectly if the original has doubled quotes (ie the escaped version has `""""`). It will then only strip one quote it seems. – Chris Mar 03 '14 at 16:05
  • This is almost perfect but it's stripping the escaped \ it needs for the split function. – DarkShadowAY Mar 03 '14 at 16:09
1

I wouldn't use regex since they are usually confusing and totally unclear what they do (like the one in your question for example). Instead this method should do the trick:

public string CleanField(string input)
{
    if (input.StartsWith("\"") && input.EndsWith("\""))
    {
        string output = input.Substring(1,input.Length-2);
        output = output.Replace("\"\"","\"");
        return output;
    }
    else
    {
        //If it doesn't start and end with quotes then it doesn't look like its been escaped so just hand it back
        return input;
    }
}

It may need tweaking but in essence it checks if the string starts and ends with a quote (which it should if it is an escaped field) and then if so takes the inside part (with the substring) and then replaces double quotes with single quotes. The code is a bit ugly due to all the escaping but there is no avoiding that.

The nice thing is this can be used easily with a bit of Linq to take an existing array and convert it.

processedFieldArray = inputfieldArray.Select(CleanField).ToArray();

I'm using arrays here purely because your linked page seems to use them where you are wanting this solution.

Chris
  • 27,210
  • 6
  • 71
  • 92
  • Thanks Chris, I was just near creating something similar however, mine didn't trim the first part as it was in a long string. I can't seem to access the scope of this function in ProcessCSV, any ideas? – DarkShadowAY Mar 03 '14 at 16:41
  • Edit: Made it a static, forgot about that. Getting an error, I'm using the data row Item Array, is that wrong use of its attributes? – DarkShadowAY Mar 03 '14 at 16:56
  • Edit 2: Got it working however the string I'm getting is "LM3,\"16\" Alloy Upgrade " which is the same problem as I was having earlier the test will always fail unless I split it and send the second part to clean it then return and rejoin. What do you think? – DarkShadowAY Mar 03 '14 at 17:02
  • As I understood it you had `"16"" Alloy Upgrade "` as one of your fields after splitting. The above function should clean this ok. What is the string you've just quoted there? And where did you get it from? Bear in mind that if you are looking at it in the debugger it will escape quotes when you are viewing it in the watch window or similar... – Chris Mar 03 '14 at 17:22
  • It seems like a restart got the filter working again. However I'm unable to change the ItemArray[pos]. It seems like I'm doing it wrong. What do you think? http://i.imgur.com/r0pSeFl.png?1 – DarkShadowAY Mar 04 '14 at 10:07
  • row[1] instead of row.Itemarray[1] seems to work. it's just not trimming the start. Everything else is okay. I get "\"16\" Alloy Upgrade \" which nets in "16" Alloy Upgrade – DarkShadowAY Mar 04 '14 at 10:37
  • Got it working using @Steve Ruble solutions paired with yours. Thanks! – DarkShadowAY Mar 04 '14 at 10:55
  • My thinking was along the lines of `row.ItemArray = r.Split(line).Select(CleanField).ToArray();` but if you've got it working then all is good. – Chris Mar 04 '14 at 11:44