0

Question:

I have a function that searches ODBC escape sequences, and then recursively replaces them with their native equivalent.

For example, given an ODBC escape sequence like: {fn concat(column1, column2)}

I replace it with its native SQL equivalent like this

if (StringComparer.OrdinalIgnoreCase.Equals("concat", strFunctionName)) {
    strArguments = strArguments;

    string[] astrArguments = strArguments.Split(',');
    string strTerm = astrArguments[0] + " || " + astrArguments[1];

    return strTerm;
}

Which yields column1 || column2 (PostGreSql syntax)

Generally speaking, I have to resolve the string "arglist_comma_separated" in {fn whatever(arglist_comma_separated)}

into its components

Needless to say, that with my current version, it doesn't work if argument 1 in arglist is something like

'hello, world'

Even worse, if I concat family and first name for example, it could be {fn whatever('d''Amato', 'Alberto')}

How would I best split the arguments ?

How can I do this with regex or by parsing ?

Edit:
Hmm, thought about it again, and had a splendid idea (replace '' with an escape character).
With that, the problem complexity has degraded from medium-high to trivial for a non-regex solution:
Addendum: Actually, only almost trivial, i forgot to consider function nesting in the splitting - stupid.

    protected static string[] GetArguments(string strAllArguments)
    {
        string EscapeCharacter = System.Convert.ToChar(8).ToString();

        strAllArguments = strAllArguments.Replace("''", EscapeCharacter);

        bool bInString = false;
        int iLastSplitAt = 0;
        int iInFunction = 0;

        System.Collections.Generic.List<string> lsArguments = new System.Collections.Generic.List<string>();

        for (int i = 0; i < strAllArguments.Length; i++)
        {
            char strCurrentChar = strAllArguments[i];

            if (strCurrentChar == '\'')
                bInString = !bInString;

            if (bInString)
                continue;


            if (strCurrentChar == '(')
                 iInFunction++;

            if (strCurrentChar == ')')
                 iInFunction--;

            if (strCurrentChar == ',')
            {
                if(iInFunction == 0)
                {
                    string strExtract = strAllArguments.Substring(iLastSplitAt, i - iLastSplitAt);
                    strExtract = strExtract.Replace(EscapeCharacter, "''");
                    lsArguments.Add(strExtract);
                    iLastSplitAt = i;
                }
            }

        }


        string strExtractLast = strAllArguments.Substring(iLastSplitAt + 1);
        strExtractLast = strExtractLast.Replace(EscapeCharacter, "''");
        lsArguments.Add(strExtractLast);

        string[] astrResult = lsArguments.ToArray();
        lsArguments.Clear();
        lsArguments = null;

        return astrResult;
    }
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
  • 1
    A similar question, where the conclusion could be "Don't do it with a regex": http://stackoverflow.com/questions/4150697/regular-expression-to-split-on-comma-except-if-quoted – David Brabant Mar 26 '12 at 09:01
  • @DavidBrabant while I agree regexes are not silver bullets, I couldn't think of a better way to handle this case. The arguments here conforms to a regular language (or the SQL lexer wouldn't be able to parse them), so it makes sense to match **each** of them with a regex. In the other question, the OP wanted to match/split **all** of them with **a single** regex, and I totally agree with you that *that* wouldn't be advisable. – mgibsonbr Mar 26 '12 at 09:49
  • @Quandary +1 your new solution looks fine to me, as long as you ensure the `EscapeCharacter` can't possibly occur in your original input. I'd only suggest moving it out of the question and posting it [as a new answer](http://meta.stackexchange.com/q/16930/177950). – mgibsonbr Mar 28 '12 at 08:02

1 Answers1

1

(Edit: I cleaned up this answer a lot, since the many feedback iterations; below are my conclusions on the subject)

Parsing a complex language is a hard task, so I'm assuming you narrowed down your problem to treating a comma-separated list of token values (such as strings, numbers, simple identifiers, etc - not complex expressions). If I'm mistaken, you probably have a bigger problem in your hands than you imagine. In that case, I'd suggest this question as a starting point.

The simplest solution - splitting on , - doesn't work mainly because of strings, since a comma can appear inside a string. Parsing a string is a straightforward task, assuming you deal with escape characters correctly: it starts with a quote, has zero or more characters and ends with another quote.

In most languages, if a string is delimited by ', you can escape a quote in it using \'. SQL interprets a '' inside a string as an escaped quote. If you know only one of those forms will be present, you can disregard the other. In my answer below, I decided to include both.

Also, some languages accept both single quotes (') and double quotes (") to delimit a string. The same observations about escaped characters apply. My solution also deals with both forms.

Besides strings, it's also important to specify what are valid characters for an argument. For simplicity, I assumed it would be "anything that is not a comma". For the same reason, my proposed solution will accept any number of strings and non-strings, and will group them together, returning them as a single entity (reiterating that, if complex expressions are expected, a more general parsing technique should be employed instead of this simple solution).

One way of implementing this would be looping through the chars while applying the logic above, as you've done in your recent update. Another would be using a regex. The regex has as pros better performance (usually) and a cleaner code, less error-prone. The main con is the complexity of the regex itself, since a "dense" format can be harder to understand/maintain.

My proposed regex would be then (spaces/newlines added for readability):

(
    (?:  \'   (?: ['\\]\' | [^'] )*   \'  |
         \"   (?: ["\\]\" | [^"] )*   \"  |
         [^,'"]
    )+
)
(?: \, | $)

In short format:

((?:\'(?:['\\]\'|[^'])*\'|\"(?:["\\]\"|[^"])*\"|[^,'"])+)(?:\,|$)

Each string accepts as "characters" either escaped quotes (' or \ followed by ') or anything that is not a quote. The match (the big capturing group) must be followed by either a , or the end of the input.

A live example of the regex above can be seen here (the example uses Ruby, but should work equally in C#). As long as the whole input is matched (i.e. no unmatched substring exists), each match will capture the argument correctly. Warning: malformed inputs will produce incorrect outputs, so the regex above must not be used for validation.

To employ this solution in your C# code, you can use Regex.Matches:

MatchCollection matches = Regex.Matches(strArguments, "((?:\'(?:['\\]\'|[^'])*\'|\"(?:["\\]\"|[^"])*\"|[^,'"])+)(?:\,|$)");
string[] arguments = from m in matches select m.Captures[1].Value;

As noted above, you must also ensure matches cover the whole input. I leave that as an exercise for the reader... ;)

Notes:

  1. I'm assuming the results of Matches are non-overlapping; if I'm mistaken, the code above must be adapted to do each match starting at the index the previous one ends;
  2. I'm also assuming, as usual, that the capturing group #0 will be the whole match, and the #1 will be first capturing group.
Community
  • 1
  • 1
mgibsonbr
  • 21,755
  • 7
  • 70
  • 112
  • That doesn't work on strArguments = "'hello d''someapostroph, world', 'test', foo"; So it will fail on {fn concat('d''Amato', 'Alberto')}. Arguments can, but don't have to be enclosed in quotes. – Stefan Steiger Mar 26 '12 at 07:55
  • Sorry, was editing my answer to fix some mistake, and didn't see the comment. The new regex should work, see [this example](http://rubular.com/r/LFTPv0fQem) – mgibsonbr Mar 26 '12 at 08:10
  • I correct myself: It doesn't work. This regex is much worse then my split function. It already fails at: "AND {fn LCASE(BE_User)} LIKE {fn LCASE({fn CONCAT(@strTerm, '%')} )} ". It takes away the at. Then it fails on many more places as well. – Stefan Steiger Mar 27 '12 at 12:40
  • You must specify what is considered valid in the unescaped part for the regex to work. Replace `\w+` for `[\w@]+` and you fix the "@", for instance. Another option would be just matching "whatever's not a comma", be it valid input or not... See the update above. – mgibsonbr Mar 27 '12 at 22:45
  • mgibsonbr: Now failing on "{fn concat({fn concat('(', DG_Code)}, ') ')}, DG_Lang_DE" with strArguments = "'(' || DG_Code, ') '"; – Stefan Steiger Mar 28 '12 at 05:00
  • From what I see, the regex output for that input is fine: test string = `'(' || DG_Code, ') '`, match 1 = `'(' || DG_Code`, match 2 = `') '`. What result were you expeting? – mgibsonbr Mar 28 '12 at 05:26
  • BTW are you using the output of one iteration as the input of another one? (from what I understood: `{fn concat({fn concat('(', DG_Code)}, ') ')}` => `{fn concat('(' || DG_Code, ') ')}`) This is not a good idea IMHO, regexes are good to tokenize inputs, not to do a complete parse. If one of your inputs can be a complex expression, then regex is the wrong tool for that. That said, if you just join them with `||` you'll get `'(' || DG_Code || ') '`, isn't that correct? – mgibsonbr Mar 28 '12 at 06:41