Regex match between two arrays of strings

Question

I have two arrays

sentences_ary = ['This is foo', 'bob is cool'] 

words_ary = ['foo', 'lol', 'something']

I want to check if any element from sentences_ary matched any word from words_ary.

I'm able to check for one work, but could not do it with word_ary.

#This is working
['This is foo', 'bob is cool'].any? { |s| s.match(/foo/)}

But I'm not able to make it work with ary of ary regex. I'm always getting true from this:

# This is not working    
['This is foo', 'bob is cool'].any? { |s| ['foo', 'lol', 'something'].any? { |w| w.match(/s/) } }

I'm using this in the if condition.

What about `s.match(w)`? – Explosion Pills Nov 19 '14 at 17:20 — Explosion Pills, Nov 19 '14 at 17:20
That worked!! write it in answer I will accept it. – Ashwin Yaprala Nov 19 '14 at 17:23 — Ashwin Yaprala, Nov 19 '14 at 17:23
Any particular reason you want to use a regex? – Cary Swoveland Nov 19 '14 at 19:41 — Cary Swoveland, Nov 19 '14 at 19:41

score 2 · Answer 1 · answered Nov 19 '14 at 17:23

2

You could use Regexp.union and Enumerable#grep:

sentences_ary.grep(Regexp.union(words_ary))
#=> ["This is foo"]

answered Nov 19 '14 at 17:23

Stefan

109,145
14
143
218

This is more cleaner than my code. This is awesome. – Ashwin Yaprala Nov 19 '14 at 17:29
Be careful using `Regexp.union`. The resulting pattern will match words, and sub-strings embedded in words. – the Tin Man Nov 19 '14 at 18:05
Yeah. I verified it. I really want this ans. I'm using spaces in words_ary if substring is not needed. Thanks for info @theTinMan – Ashwin Yaprala Nov 19 '14 at 18:18
2

Using spaces isn't the way to go if a word could occur at the start or end of a string, or have punctuation before or after it. – the Tin Man Nov 19 '14 at 18:32
1

I love your answer (+1!), because I can see many uses for it in other applications, but, alas, I don't think it's quite right here, for it doesn't deal with punctuation (as @theTinMan mentioned) or case. You could modify it to make it more robust, but then it would be prosiac. – Cary Swoveland Nov 19 '14 at 19:48

score 1 · Accepted Answer · edited May 23 '17 at 12:34

RegexpTrie improves this:

require 'regexp_trie'

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']

words_regex = /\b(?:#{RegexpTrie.union(words_ary, option: Regexp::IGNORECASE).source})\b/i
# => /\b(?:(?:foo|lol|something))\b/i

sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

You have to be careful how you construct your regex pattern, otherwise you can get false-positive hits. That can be a difficult bug to track down.

sentences_ary = ['This is foo', 'This is foolish', 'bob is cool', 'foo bar', 'bar foo']
words_ary = ['foo', 'lol', 'something']
words_regex = /\b(?:#{ Regexp.union(words_ary).source })\b/ # => /\b(?:foo|lol|something)\b/
sentences_ary.any?{ |s| s[words_regex] } # => true
sentences_ary.find{ |s| s[words_regex] } # => "This is foo"
sentences_ary.select{ |s| s[words_regex] } # => ["This is foo", "foo bar", "bar foo"]

The /\b(?:foo|lol|something)\b/ pattern that is generated is smart enough to look for word-boundaries, which will find words, not just sub-strings.

Also, notice the use of source. This is very important because its absence can lead to a very hard to locate bug. Compare these two regexp:

/#{ Regexp.union(words_ary).source }/ # => /foo|lol|something/
/#{ Regexp.union(words_ary) }/        # => /(?-mix:foo|lol|something)/

Notice how the second one has the embedded flags (?-mix:...). They change the flags for the enclosed pattern, inside the surrounding pattern. It's possible to have that inner pattern behave differently than the surrounding one resulting in a black hole sucking in results you don't expect.

Even the Regexp union documentation shows the situation but doesn't mention why it can be bad:

Regexp.union(/dogs/, /cats/i)        #=> /(?-mix:dogs)|(?i-mx:cats)/

Notice that in this case, both patterns have different flags. On our team we use union often, but I'm always careful to look to see how it's being used during peer reviews. I got bit by this once, and it was tough figuring out what was wrong, so I am very sensitive to it. Though union takes patterns, as in the example, I recommend not using them and instead use an array of words or the pattern as a string, to avoid those pesky flags sneaking in there. There's a time and place for them, but knowing about this allows us to control when they get used.

Read through the Regexp documentation multiple times, as there's a lot to learn and it will be overwhelming the first several passes through it.

And, for extra-credit, read:

Cary Swoveland · Answer 3 · 2014-11-19T19:34:23.577

Another way:

def good_sentences(sentences_ary, words_ary)
  sentences_ary.select do |s|
    (s.downcase.gsub(/[^a-z\s]/,'').split & words_ary).any?
  end
end

For the example:

sentences_ary = ['This is foo', 'bob is cool']
words_ary = ['foo', 'lol', 'something']

good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a case case:

words_ary = ['this', 'lol', 'something']
  #=> ["This is foo"]
good_sentences(sentences_ary, words_ary)
  #=> ["This is foo"]

For a punctuation case:

sentences_ary = ['This is Foo!', 'bob is very "cool" indeed!']
words_ary = ['foo', 'lol', 'cool']
good_sentences(sentences_ary, words_ary)
  #=> ["This is Foo!", "bob is very \"cool\" indeed!"]

Regex match between two arrays of strings

3 Answers3