-2

I have little problem in regular expressin creation. Expected input:

blahblahblah, blahblahblah, 'blahblahblah', "blahblahblah, asdfd"

I need to get words separated with comma to array. But, I cannot use split function, 'cause comma can occure in strings too. So, Expected output is:

arr[0] = blahblahblah
arr[1] = blahblahblah
arr[2] = 'blahblahblah'
arr[3] = "blahblahblah, asdfd"

Does anybody know some regular expression or some another solution that can help me and give me similair output? Please help.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
user35443
  • 6,309
  • 12
  • 52
  • 75
  • I just need to get words from input separated by comma. – user35443 Apr 04 '12 at 16:49
  • 2
    looks suspiciously like CSV format. – Jodrell Apr 04 '12 at 16:51
  • Yes, I need values separated by comma. – user35443 Apr 04 '12 at 16:52
  • except when the comma is contained in double quotes, but what about double quotes within double quotes, is that allowed? – Jodrell Apr 04 '12 at 16:56
  • 1
    So, is this actually any line of CSV or is this problem limited exactly to your example and just pseudo CSV? – Jodrell Apr 04 '12 at 17:00
  • CSV does not support `'blahblah'`, just `blahblah` or `"blahblah"` – Ωmega Apr 04 '12 at 17:14
  • 1
    How do you want to handle strings like `"First "" item"`, as by CSV it is one string, because `""` is converted to `"` inside of the string item... – Ωmega Apr 04 '12 at 17:16
  • This is a twist on the classic [XY Problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Your actual problem is how to split input by commas, except ones in quotes. The title of your question makes no mention of your actual problem! This makes it less likely that you'll get the help you need. You're limiting the pool of answerers to people who are both interested enough in problem Y to read further, and know enough about problem X to give a good solution. – Kevin Apr 04 '12 at 17:27
  • Not sure how you want to handle spaces between items and newlines... – Ωmega Apr 04 '12 at 17:28
  • I suggest you to convert input to **CSV** standard and then use some technique for such standard... – Ωmega Apr 04 '12 at 17:29

4 Answers4

0

I'm not sure this is the most optimal, but it produced the correct output from you test case on http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx:

(?>"[^"]*")|(?>'[^']*')|(?>[^,\s]+)

C# string version:

@"(?>""[^""]*"")|(?>'[^']*')|(?>[^,\s]+)"
FishBasketGordo
  • 22,904
  • 4
  • 58
  • 91
  • Will **not** work for `"first "" item", "Second Item", Third` – Ωmega Apr 04 '12 at 17:13
  • @stackoverflow - Yes, and I didn't expect it to. This requires the quoted strings to not contain similar quotation marks. As I said, it produces the correct output for the (limited) test case given. – FishBasketGordo Apr 04 '12 at 17:36
  • @FishBasketGordo - your code works for limited specification, which is what user35443 asked for... – Ωmega Apr 04 '12 at 17:44
0

One possible approach is to split by commas (using string.Split, not RegEx) and then iterate over the results. For each result that contains 0 or 2 ' or " characters, add it to a new list. When a result contains 1 ' or ", re-join subsequent items (adding a comma) until the result has 2 ' or ", then add that to the new list.

Jay
  • 56,361
  • 10
  • 99
  • 123
  • Oh, well that's a simple solution. – Mooing Duck Apr 04 '12 at 17:26
  • @MooingDuck - are you serious? – Ωmega Apr 04 '12 at 17:30
  • @stackoverflow: This is not the fastest or most elegant answer, but it's very simple to understand, and gets the right results. I can't validate the rest of the answers, because those Regexes are beyond me. This and Jodrell's are the only suggestions that _I_ could do. – Mooing Duck Apr 04 '12 at 17:43
0

You could do somthing like this, given the limited problem. The Regex is shorter and possibly simpler.

string line = <YourLine>
var result = new StringBuilder();
var inQuotes = false;

foreach(char c in line)
{
    switch (c)
    {
        case '"':
            result.Append()
            inQuotes = !inQuotes;
            break;

        case ',':
            if (!inQuotes)
            {
                yield return result.ToString();
                result.Clear();
            }

        default:
            result.Append()
            break;                
    }
}
Mooing Duck
  • 64,318
  • 19
  • 100
  • 158
Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • **user35443** wants also support of `'`, not just `"`, even it is not standard behavior... – Ωmega Apr 04 '12 at 17:26
  • @user35443 - So then you should edit your question, because you accepted answer that is not what question is asking for... And SO is here for other readers as well, so don't confuse them. – Ωmega Apr 04 '12 at 17:46
  • The use of `yield return` and fallthrough case blocks is not recommended. However, I do like the concept. Fast and easy to understand. Also: @stackoverflow: Simple fix. – Mooing Duck Apr 04 '12 at 17:47
  • @MooingDuck - I meant to edit question, not answer. Your edit makes code useless, as it will now match `"one', 'two"` as two elements! – Ωmega Apr 04 '12 at 17:53
  • @stackoverflow: Ah, didn't think of nesting. I rolled the edit back, that's a much more substantial edit than I thought it was. – Mooing Duck Apr 04 '12 at 17:54
0

Instead of rolling your own CSV parser, consider using the standard, out-of-the-box TextFieldParser class that ships with the .NET Framework.

Or alternatively, use Microsoft Ace and an OleDbDataReader to directly read the files through ADO.NET. A sample can be found in a number of other posts, like this one. And there's this older post on CodeProject which you can use as a sample. Just make sure you're referencing the latest Ace driver instead of the old Jet.OLEDB.4.0 driver

These options are a lot easier to maintain in the long run than any custom built file parser. And they already know how to handle the many corner cases that surround the not so well documented CSV format.

Community
  • 1
  • 1
jessehouwing
  • 106,458
  • 22
  • 256
  • 341