0

Here is a Ruby question guys. So need to parse through the html file and catch urls and emails can't come up with proper regex expression. Tried 100+ regexes and all the times I cash something else with the url.

File.open("/Desktop/file.html").each_line do |line|

 if line.split("href=\"") =~ /???/
 puts line

  end

  end

# I can use line.split("href=\"") so each new line will start with url => (https://www.facebook.com/students">

The question is what regex can I use to catch everything from https to the end of the url which ends with (") (there could be one or more samples of same url so {1,2} is needed

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Alania
  • 1
  • 1
  • 6
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Addison Oct 25 '16 at 02:53
  • 3
    Use nokogiri instad of regex for this. – lulalala Oct 25 '16 at 05:42

1 Answers1

0

Try this

file = File.open('filename_path')
links = file.read().scan(/href=\"(?<url>.*?)\"/)

you get links in array
it also works if you remove ?<url> from above(it's just named capture group)

user2301346
  • 438
  • 3
  • 11