1

So given a string like this "\"turkey AND ham\" NOT \"roast beef\"" I need to get an array with the inner strings like so: ["turkey AND ham", "roast beef"] and eliminate OR's, AND's and NOT's that may or may not be there.

With the help of Rubular I came up with this regex /\\["']([^"']*)\\["']/

which returns the following 2 groups:

Match 1 1. turkey AND ham Match 2 1. roast beef

however when I use it with .scan keep getting and empty array.

I looked at this and this other SO posts, and a few others, but can not figure out where I am going wrong

Here is the result from my rails console:

=> q = "\"turkey and ham\" OR \"roast beef\"" => q.scan(/\\["']([^"']*)\\["']/) => []

Expectation: ["turkey AND ham", "roast beef"]

I shall also mention I suck at regex.

Community
  • 1
  • 1
Jax
  • 1,839
  • 3
  • 18
  • 30
  • 2
    You seem to overescape the pattern. Use `q.scan(/["']([^"']*)["']/)`. With double backslashes, you defined a literal backslash, and there is no backslash in the string returning no matches. – Wiktor Stribiżew Oct 13 '16 at 17:30
  • 1
    to expand on what @WiktorStribiżew stated your actual string is `'"turkey AND ham" NOT "roast beef"'` the `\` are to escape the double quotes for output and the regex he posted will perform correctly [Example](http://rubular.com/r/kW2pP3zjum) – engineersmnky Oct 13 '16 at 17:37

2 Answers2

3

When the regex used with scan contains a capture group (@davidhu2000's approach), one generally can use lookarounds1 instead. It's just a matter of personal preference. To allow for double-quoted strings that contain either single- or (escaped) double-quoted strings, you could use the following regex.

r = /
    (?<=") # match a double quote in a positive lookbehind
    [^"]+  # match one or more characters that are not double-quotes
    (?=")  # match a double quote in a positive lookahead
    |      # or
    (?<=') # match a single quote in a positive lookbehind
    [^']+  # match one or more characters that are not single-quotes
    (?=')  # match a single quote in a positive lookahead
    /x    # free-spacing regex definition mode

"\"turkey AND ham\" NOT 'roast beef'".scan(r)
  #=> ["turkey AND ham", "roast beef"]

As '"turkey AND ham" NOT "roast beef"' #=> "\"turkey AND ham\" NOT \"roast beef\"" (i.e., how the single-quoted string is saved), we need not be concerned about that being an additional case to deal with.

1 For any in the audience who still consider regular expressions to be black magic, there are four kinds of lookarounds (positive and negative lookbehinds and lookaheads) as elaborated in the doc for Regexp. Sometimes they are regarded as "zero-width" matches as they are not part of the matched text.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Elegant solution without the need to flatten any array. Thought still considering regex _a sort of black magic_ :) – Jax Oct 14 '16 at 08:31
2

You regex is trying to match \, which won't match anything in the string, since the \ existed to escape the double quote, and won't be part of the string.

So if you remove \\ in your regex

res = q.scan(/["']([^"']*)["']/)

This will return a 2d array

res = [["turkey and ham"], ["roast beef"]]

Each inner array is all the matching groups from the regex, so if you have two capture groups in your regex, you will see two items in the inner array.

If you want a simple array, you can run flatten method on the array.

davidhu
  • 9,523
  • 6
  • 32
  • 53