-5
>> "<img src=\"https://filin.mail.ru/pic?width=90&amp;height=90&amp;email=multicc%40multicc.mail.ru&amp;version=4&amp;build=7\" style="">".match(Regexp.new("<a href=\"http(s?):\/\/(?:\w+\.)+\w{1,5}.+?\">|<img src=\"http(s?):\/\/(?:\w+\.)+\w{1,5}.+?\"(?: style=\".+\")?>"))
=> nil

But testing in Rubular says it should be catched

link

I can't understand why testing with Rubular says that this string should be catched, and actually it is not.

Joe Half Face
  • 2,303
  • 1
  • 17
  • 45

3 Answers3

2

Regex is the wrong tool for handling HTML (or XML) 99.9% of the time. Instead, use a parser, like Nokogiri:

require 'nokogiri'

html = '<img src="https://filin.mail.ru/pic?width=90&amp;height=90&amp;email=multicc%40multicc.mail.ru&amp;version=4&amp;build=7" style="">'
doc = Nokogiri::HTML(html)

url = doc.at('img')['src'] # => "https://filin.mail.ru/pic?width=90&height=90&email=multicc%40multicc.mail.ru&version=4&build=7"
doc.at('img')['style'] # => ""

Once you've retrieved the data you want, such as the src, use another "right" tool, such as URI:

require 'uri'

scheme, userinfo, host, port, registry, path, opaque, query, fragment = URI.split(url)
scheme    # => "https"
userinfo  # => nil
host      # => "filin.mail.ru"
port      # => nil
registry  # => nil
path      # => "/pic"
opaque    # => nil
query     # => "width=90&height=90&email=multicc%40multicc.mail.ru&version=4&build=7"
fragment  # => nil

query_parts = Hash[URI.decode_www_form(query)]
query_parts # => {"width"=>"90", "height"=>"90", "email"=>"multicc@multicc.mail.ru", "version"=>"4", "build"=>"7"}
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
0

It works fine when you call match on the Regex, although I strongly recommend you don't use a regex to parse HTML.

str = '<img src="https://filin.mail.ru/pic?width=90&amp;height=90&amp;email=multicc%40multicc.mail.ru&amp;version=4&amp;build=7" style="">'

matchData = /<img src="http(?:s?):\/\/(?:\w+\.)+\w{1,5}.+?"(?: style=".+")?>/.match(str)

p matchData[0] # => "<img src=\"https://filin.mail.ru/pic?width=90&amp;height=90&amp;email=multicc%40multicc.mail.ru&amp;version=4&amp;build=7\" style=\"\">"
Rob Wagner
  • 4,391
  • 15
  • 24
0

This works for me:

'<img src="https://filin.mail.ru/pic?width=90&amp;height=90&amp;email=multicc%40multicc.mail.ru&amp;version=4&amp;build=7" style="">'.match(/<img src="https?:\/\/(?:\w+\.)+\w{1,5}.+?"(?: style=".+")?>/)

Not sure why yours doesn't work exactly, although I do notice you forgot to escape the last two double quotes in the match string. I used single quotes to avoid that issue

Doydle
  • 906
  • 5
  • 11