I need a good regex for HTML file parsing in ruby

Question

Here is a Ruby question guys. So need to parse through the html file and catch urls and emails can't come up with proper regex expression. Tried 100+ regexes and all the times I cash something else with the url.

File.open("/Desktop/file.html").each_line do |line|

 if line.split("href=\"") =~ /???/
 puts line

  end

  end

# I can use line.split("href=\"") so each new line will start with url => (https://www.facebook.com/students">

The question is what regex can I use to catch everything from https to the end of the url which ends with (") (there could be one or more samples of same url so {1,2} is needed

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Addison, Oct 25 '16 at 02:53

score 0 · Answer 1 · answered Oct 25 '16 at 09:38

0

Try this

file = File.open('filename_path')
links = file.read().scan(/href=\"(?<url>.*?)\"/)

you get links in array
it also works if you remove ?<url> from above(it's just named capture group)

answered Oct 25 '16 at 09:38

user2301346

438
3
11

I need a good regex for HTML file parsing in ruby

1 Answers1