-1

Here is my best attempt (so far) to solve this issue. I'm new to regular expressions and this problem is pretty substantial, but I'll give it a try. RegEx's clearly take some time to master.

This seems to satisfy the delimiter/comma requirements. To me it seems redundant though because of the repeated /s*. There is likely a better way.

/\s*[,|\s*]\s*/

I found this on SOF and am trying to tear it apart and apply it to my problem (not easy). This seems to satisfy most of the "quoting" requirements, but I'm still working on how to solve the delimiter issues in the requirements below.

/"(?:\\\\.|[^\\\\"])*"|\S+/

The requirements I'm trying to meet:

  • Will be used by the PHP preg_match_all() (or similar) function to break a string into an array of strings. Source language is PHP.
  • Words in the input string are delimited by (0 or more whitespace)(optional comma)(0 or more whitespace) or just (1 or more whitespace).
  • The input string can also have quoted substrings which become a single element in the output array.
  • Quoted substrings in the input string must retain their double quotes when placed in the output array (because we must be able to identify them later as being originally quoted in the input string).
  • Leading and trailing whitespace (that is, whitespace between the double-quote character and the string itself) in quoted substrings must be removed when placed into the output array. Example: "<space>hello<space>world<space><tab>" becomes "hello<space>world"
  • Whitespace within quoted phrases in the input string must be reduced to a single space when placed into its output array element. Example: "hello<space><tab><space><space>world" becomes "hello<space>world"
  • Quoted substrings in the input string that are zero-length or contain only whitespace are not placed into the output array (The output array must not contain any zero-length elements).
  • Each element of the output array must be trimmed (left and right) for whitespace.

This example demonstrates all requirements above:

Input String:

"" one " two     three " four  ,  five "   six seven " " "

Returns this array (double quotes actually exist in the strings shown below):

{one,"two three",four,five,"six seven"}

EDIT 9/13/2013

I have been studying regular expressions hard for a couple days and finally settled on this proposed solution. It may not be the best, but it's what I have at this time.

I will use this regex to split the search string into an array using PHP's preg_match_all() function:

/(?:"([^"]*)"|([^\s",]+))/

The leading/trailing "/" is required by the php function preg_match_all().

Now that the array is created, we retrieve it from the function call like this:

$x = preg_match_all(REGEX);
$Array = $x[0];

We have to do this because the function returns a compound array and element 0 contains the actual output of the regex. The other returned elements contain values captured by the regex, which we don't need.

Now, I will iterate the resulting array and process each element to meet the requirements (above), which will be much easier than meeting all the requirements in a single step using single regex.

Russ
  • 31
  • 5
  • 1
    That's a really detailed requirement list, but you are just asking us to make the pattern for you. We don't like doing that too much. We would much rather have you *attempt* to solve the problem, and come to us when you need help. Because otherwise, this just sounds like homework, and we don't like doing other's homework. – gunr2171 Sep 08 '13 at 00:58
  • 1
    With a question like that, you should be asking what rates we charge. – Andy Lester Sep 08 '13 at 03:55
  • I appreciate your comments. I am new to regular expressions and have been studying them for hours now trying to figure this out, but this problem is pretty big and well over my head at this time. Regular expression clearly take some time to master. Nevertheless, I'll do my best. – Russ Sep 08 '13 at 08:44
  • @Russ Hello there and +1 for the detailed question. I want to mention that this is impossible to do with a single regular expression. I'll help you with a good start. Take a look at [this regex](http://regex101.com/r/xZ8yZ3), loop through group 1 and try to filter things out with PHP (you may use another regex :P ?). This is inspired by [this](http://stackoverflow.com/questions/17848618/parsing-command-arguments-in-php/18217486#18217486) answer (for further reading and explanation). – HamZa Sep 08 '13 at 09:35
  • OK Thanks. I didn't realize it would take more than 1 expression/pass. I'll check out your links and I'll work on breaking the problem down into steps. If it takes multiple steps, that's no problem. I'm going to have to work on this until it's solved. Thanks. – Russ Sep 08 '13 at 09:57

1 Answers1

0

I finally have developed a solution for this problem which involved a few PHP statements utilizing regular expressions. Below is the final function.

This function is part of a class which is why it begins with "public".

public function SearchString_ToArr($SearchString) {
    /*
    Purpose
        Used to parse the specified search string into an array of search terms.
        Search terms are delimited by <0 or more whitespace><optional comma><0 or more whitespace>
    Parameters
        SearchString (string) = The search string we're working with.
    Return (array)
        Returns an array using the following rules to parse the specified search string:
            - Each search term from the search string is converted to a single element in the returned array.
            - Search terms are delimited by whitespace and/or commas, or they may be double quoted.
            - Double-quoted search terms may contain multiple words.
        Unquoted Search Terms:
            - These are delimited by any number of whitespace characters or commas in the search string.
            - These have all leading and trailing whitespace trimmed.
        Quoted Search Terms:
            - These are surrounded by double-quotes in the search string.
            - These retain leading and trailing double-quotes in the returned array.
            - These have all leading and trailing whitespace trimmed.
            - These may contain whitespace.
            - These have all containing whitespace converted into a single space.
            - If these are zero-length or contain only whitespace, they are not included in the returned array.
        Example 1:
            SearchString =  ' "" one " two   three " four "five six" " " '
            Returns {"one", ""two three"", "four", ""five six""}
            Notes   The leading whitespace before the first "" is not returned.
                    The first quoted phrase ("") is empty so it is not returned.
                    The term "one" is returned with leading and trailing whitespace removed.
                    The phrase "two three" is returned with leading and trailing whitspace removed.
                    The phrase "two three" has containing whitespace converted to a single space.
                    The phrase "two three" has leading and trailing double-quotes retained.
                    ...
    Version History
        1.0 2013.09.18 Tested by Russ Tanner on PHP 5.3.10.
    */

    $r = array();
    $Matches = array();

    // Split the search string into an array based on whitespace, commas, and double-quoted phrases.
    preg_match_all('/(?:"([^"]*)"|([^\s",]+))/', $SearchString, $Matches);
    // At this point:
    //  1. all quoted strings have their own element and begin/end with the quote character.
    //  2. all non-quoted strings have their own element and are trimmed.
    //  3. empty strings are omitted.

    // Normalize quoted elements...
    // Convert all internal whitespace to a single space.
    $r = preg_replace('/\s\s+/', ' ', $Matches[0]);
    // Remove all whitespace between the double-quotes and the string.
    $r = preg_replace('/^"\s+/', '"', $r);
    $r = preg_replace('/\s+"$/', '"', $r);

    return $r;
}
Russ
  • 31
  • 5