Ruby - Matching Twitter URL from any html page using Regex

Question

I am trying to fetch the Twitter URL from this page for instance; however, my result is nil. I am pretty sure my regex is not too bad, but my code fails. Here is it :

doc = `(curl --url "http://www.rabbitreel.com/")`
twitter_url = ("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+").match(doc)
puts twitter_url
# => nil

Maybe, I misused regex syntax. My initial idea was simple: I wanted to match a regular Twitter url structure. I even tried http://rubular.com to test my regex, and it seemed to be fine when I entered a Twitter url.

Eric Duminil · Accepted Answer · 2016-06-30T17:08:33.793

http://ruby-doc.org/core-2.2.0/String.html#method-i-match

tells you that the object you're calling match on should be the string you're parsing, and the parameter should be the regex pattern. So if anything, you should call :

doc.match("/^(?i)[http|https]+:\/\/(?i)[twitter]+\.(?i)(com)\/?\S+")

I prefer

doc[/your_regex/]

syntax, because it directly delivers a String, and not a MatchData, which needs another step to get the information out of.

For Regexen, I always try to begin as simple as possible

[3] pry(main)> doc[/twitter/]
=> "twitter"
[4] pry(main)> doc[/twitter\.com/]
=> "twitter.com"
[5] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[6] pry(main)> doc[/twitter\.com\/\//] #OOPS. One \/ too many
=> nil
[7] pry(main)> doc[/twitter\.com\//]
=> "twitter.com/"
[8] pry(main)> doc[/twitter\.com\/\S+/]
=> "twitter.com/rabbitreel\""
[9] pry(main)> doc[/twitter\.com\/[^"]+/]
=> "twitter.com/rabbitreel"
[10] pry(main)> doc[/http:\/\/twitter\.com\/[^"]+/]
=> nil
[11] pry(main)> doc[/https?:\/\/twitter\.com\/[^"]+/]
=> "https://twitter.com/rabbitreel"
[12] pry(main)> doc[/https?:\/\/twitter\.com\/[^" ]+/]
=> "https://twitter.com/rabbitreel"
[13] pry(main)> doc[/https?:\/\/twitter\.com\/\w+/] #DONE
=> "https://twitter.com/rabbitreel"

EDIT: Sure, Regexen cannot parse an entire HTML document. Here, we only want to find the first occurence of a Twitter URL. So, depending on the requirements, on possible input and the chosen platform, it could make sense to use a Regexp.

Nokogiri is a huge gem, and it might not be possible to install it.

Independently from this fact, it would be a very good idea to check that the returned String really is a correct Twitter URL.

I think this Regexp:

/https?:\/\/twitter\.com\/\w+/

is safe.

[31] pry(main)> malicious_doc = "https://twitter.com/userid@maliciouswebsite.com"
=> "https://twitter.com/userid@maliciouswebsite.com"
[32] pry(main)> malicious_doc[/https?:\/\/twitter\.com\/\w+/]
=> "https://twitter.com/userid"

Using Nokogiri doesn't prevent you from checking for malicious input. The proposed solution from @mudasobwa is interesting, but isn't safe yet:

[33] pry(main)> Nokogiri::HTML('<html><body><a href="http://maliciouswebsitethatisnottwitter.com/">Link</a></body></html>').css('a').map { |e| e.attributes.values.first.value }.select {|e| e =~ /twitter.com/ }
=> ["http://maliciouswebsitethatisnottwitter.com/"]

Thanks for the input @EricDuminil much appreciated. I like the way you go step by step ! — Eric, Jun 28 '16 at 16:08

Aleksei Matiushkin · Answer 2 · 2021-11-02T07:27:47.450

0

NB as of Nov 2021, rabbitreel.com domain is on sale, so please read the comments about the possibility of it’s serving malicious content.

One should never use regexps to parse HTML and here is why.

Below is a robust solution using Nokogiri HTML parsing library:

require 'nokogiri'
doc = Nokogiri::HTML(`(curl --url "http://www.rabbitreel.com/")`)
doc.css('a').map { |e| e.attributes.values.first.value }
            .select {|e| e =~ /twitter.com/ }
#⇒ [
#   [0] "https://twitter.com/rabbitreel",
#   [1] "https://twitter.com/rabbitreel"
# ]

Or, alternatively, with xpath:

require 'nokogiri'
doc = Nokogiri::HTML(`(curl --url "http://www.rabbitreel.com/")`)
doc.xpath('//a[contains(@href, "twitter.com")]')
   .map { |e| e.attributes['href'].value }

edited Nov 02 '21 at 07:27

answered Jun 28 '16 at 16:05

Aleksei Matiushkin

119,336
10
100
160

Thanks for your help @mudasobwa, I am gonna try the Nokogiri way too :) – Eric Jun 28 '16 at 16:09
1

You are free to shoot your own legs with regexps, the decision is always up to you. It is worth mention that no one professional sane ruby developer will use regexps for that purpose, though. – Aleksei Matiushkin Jun 28 '16 at 16:11
With all due respect, you shot yourself in the foot with a Regexp. `/twitter.com/` matches `"http://maliciouswebsitethatisnottwitter.com/"` and even matches `"http://maliciouswebsite.com/twitter/com"` – Eric Duminil Jun 30 '16 at 17:11
@EricDuminil It’s good that you aware of that, but on the given site http://www.rabbitreel.com/ there are no malicious links. – Aleksei Matiushkin Jun 30 '16 at 17:55
1

@AlekseiMatiushkin Hi! 5 years later : rabbitreel.com is for sale, so it could definitely serve malicious links. – Eric Duminil Nov 02 '21 at 07:22
1

@EricDuminil lol, I’ll update the answer. – Aleksei Matiushkin Nov 02 '21 at 07:26

Ruby - Matching Twitter URL from any html page using Regex

2 Answers2