2

I'm trying to write a regular expression in Ruby where I want to see if the string contains a certain word (e.g. "string"), followed by a url and link name in parenthesis.

Right now I'm doing:

string.include?("string") && string.scan(/\(([^\)]+)\)/).present?

My input in both conditionals is a string. In the first one, I'm checking if it contains the word "link" and then I will have the link and link_name in parenthesis, like this:

"Please go to link( url link_name)"

After validating that, I extract the HTML link.

Is there a way I can combine them using regular expressions?

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43
learningruby347
  • 211
  • 1
  • 4
  • 12
  • 1
    Could you provide an example of input and how you would imagine the output based on those? – Rion Williams May 11 '16 at 18:42
  • @RionWilliams My input in both conditionals is basically a string. In the first one, I'm checking if it contains the word "link" and then I will have stuff between a pair of parenthesis. A link and link_name will be stored in the parenthesis. E.g "Please go to link( url link_name)" And then I'm doing some string manipulation to change that to an html link. – learningruby347 May 11 '16 at 18:49

4 Answers4

3

The most important improvement you can make is to also test that the word and the parentheseses have the correct relationship. If I understand correctly, "link(url link_name)" should be a match but "(url link_name)link" or "link stuff (url link_name)" should not. So match "link", the parentheses, and their contents, and capture the contents, all at once:

"stuff link(url link_name) more stuff".match(/link\((\S+?) (\S+?)\)/)&.captures
=> ["url", "link_name"]

(&. is Ruby 2.3; use Rails' .try :captures in older versions.)

Side note: string.scan(regex).present? is more concisely written as string =~ regex.

Community
  • 1
  • 1
Dave Schweisguth
  • 36,475
  • 10
  • 98
  • 121
  • If there is no match, you obtain `nil.captures`, which raises an exception. Also, your regex requires the "certain word" to precede the parenthetical expression, but the order is not part of the specification: `"stuff (url link_name) link more stuff".match(/link\((\S+?) (\S+?)\)/) #=> nil`. – Cary Swoveland May 12 '16 at 17:22
  • Exception avoided in one character. I deduced the specification from learningruby347's comment on their question and they seem to agree, but, learningruby347, if I misunderstood just say the word. – Dave Schweisguth May 12 '16 at 17:42
  • Nice fix (!), but you should explain `&.` (at least a link) for newbies and those not on top of recent changes to Ruby. (I was going to suggest using `scan` instead of `match`, but then you'd have to use `flatten`, resulting in the same problem when there's no match. ¯\\_(ツ)_/¯ ) – Cary Swoveland May 12 '16 at 17:54
2

Checking If a Word Is Contained

If you want to find matches that contain a specific word somewhere in the string, you can accomplish this through a lookahead :

# This will match any string that contains your string "{your-string-here}"
(?=.*({your-string-here}).*).*

You could consider building a string version of your expression and passing the word you are looking for using a variable :

wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*/
    # stringToTest contains "link"
else
    # stringToTest does not contain "link"
end

Checking for a Word AND Parentheses

If you also wanted to ensure that somewhere in your string you had a set of parentheses with some content in them and your previous lookahead for a word, you could use :

# This will match any strings that contain your word and contain a set of parentheses 
(?=.*({your-string-here}).*).*\([^\)]+\).*

which might be used as :

wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*\([^\)]+\).*/
    # stringToTest contains "link" and some non-empty parentheses
else
    # stringToTest does not contain "link" or non-empty parentheses
end
Rion Williams
  • 74,820
  • 37
  • 200
  • 327
2
def has_both?(str, word)
  str.scan(/\b#{word}\b|(?<=\()[^\(\)]+(?=\))/).size == 2
end

has_both?("Wait for me, Wild Bill.", "Bill")
  #=> false 
has_both?("Wait (for me), Wild William.", "Bill")
  #=> false 
has_both?("Wait (for me), Wild Billy.", "Bill")
  #=> false 
has_both?("Wait (for me), Wild bill.", "Bill")
  #=> false 
has_both?("Wait (for me, Wild Bill.", "Bill")
  #=> false 
has_both?("Wait (for me), Wild Bill.", "Bill")
  #=> true 
has_both?("Wait ((for me), Wild Bill.", "Bill")
  #=> true 
has_both?("Wait ((for me)), Wild Bill.", "Bill")
  #=> true 

These are the calculations for

word = "Bill"
str = "Wait (for me), Wild Bill."

r = /
    \b#{word}\b  # match the value of the variable 'word' with word breaks for and aft
    |         # or
    (?<=\()   # match a left paren in a positive lookbehind
    [^\(\)]+  # match one or more characters other than parens
    (?=\))    # match a right paren in a positive lookahead
    /x        # free-spacing regex definition mode
  #=> /
      \bBill\b  # match the value of the variable 'word' with word breaks for and aft
      |         # or
      (?<=\()   # match a left paren in a positive lookbehind
      [^\(\)]+  # match one or more characters other than parens
      (?=\))    # match a right paren in a positive lookahead
      /x 

arr = str.scan(r)
  #=> ["for me", "Bill"]
arr.size == 2
  #=> true
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
1

I would go with something like this regex:

/link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i

This will find any match starting with the word link, followed by any number of spaces, then a url followed by a link name, both in parentheses. In this regex, the link name is optional, but the url is not. The matching is case-insensitive, so it will match link and LINK exactly the same.

You can use the Regexp#match method to compare the regex to a string, and check the result for matches and captures, like so:

m = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match("link (stackoverflow.com StackOverflow)")
if m  # the match array is not nil
  puts "Matched: #{m[0]}"
  puts " -- url: {m[1]}"
  puts " -- link-name: #{m[2] || 'none'}"
else  # the match array is nil, so no match was found
  puts "No match found"
end

If you'd like to use different strings to identify the match, you can use a non-capturing group, where you change link to something like:

(?:link|site|website|url)

In this case, the (?: syntax says not to capture this part of the match. If you want to capture which term matched, simply change that from (?: to (, and adjust the capture indexes by 1 to account for the new capture value.

Here's a short Ruby test program:

data = [
  [ true, "link (http://google.com Google)", "http://google.com", "Google" ],
  [ true, "LiNk(ftp://website.org)", "ftp://website.org", nil ],
  [ true, "link   (https://facebook.com/realstanlee/ Stan Lee) linkety link", "https://facebook.com/realstanlee/", "Stan Lee" ],
  [ true, "x  link (https://mail.yahoo.com Yahoo! Mail)", "https://mail.yahoo.com", "Yahoo! Mail" ],
  [ false, "link lunk (http://www.com)", nil, nil ]
]

data.each do |test_case|
  link = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match(test_case[1])
  url = link ? link[1] : nil
  link_name = link ? link[2] : nil
  success = test_case[0] == !link.nil?  && test_case[2] == url && test_case[3] == link_name
  puts "#{success ? 'Pass' : 'Fail'}: '#{test_case[1]}' #{link ? 'found' : 'not found'}"
  if success && link
    puts " -- url: '#{url}' link_name: '#{link_name || '(no link name)'}'"
  end
end

This produces the following output:

Pass: 'link (http://google.com Google)' found
 -- url: 'http://google.com' link_name: 'Google'
Pass: 'LiNk(ftp://website.org)' found
 -- url: 'ftp://website.org' link_name: '(no link name)'
Pass: 'link   (https://facebook.com/realstanlee/ Stan Lee) linkety link' found
 -- url: 'https://facebook.com/realstanlee/' link_name: 'Stan Lee'
Pass: 'x  link (https://mail.yahoo.com Yahoo! Mail)' found
 -- url: 'https://mail.yahoo.com' link_name: 'Yahoo! Mail'
Pass: 'link lunk (http://www.com)' not found

If you want to allow anything other than spaces between the word 'link' and the first paren, simply change the \s* to [^\(]* and you should be good to go.

Michael Gaskill
  • 7,913
  • 10
  • 38
  • 43