How to extract URLs from text

Question

How do I extract all URLs from a plain text file in Ruby?

I tried some libraries but they fail in some cases. What's the best way?

Which libraries have you tried, and in what way are they failing? — Zaz, Sep 08 '10 at 06:36
When asking a question like this, we expect to see your attempt at solving the problem. We're happy to help fix your code, but asking us to write code for you is off-topic. Please read "[ask]" and "[mcve]". — the Tin Man, Mar 01 '16 at 18:55

score 108 · Answer 1 · answered Mar 15 '12 at 09:02

108

If you like using what's already provided for you in Ruby:

require "uri"
URI.extract("text here http://foo.example.org/bla and here mailto:test@example.com and here also.")
# => ["http://foo.example.org/bla", "mailto:test@example.com"]

Read more: http://railsapi.com/doc/ruby-v1.8/classes/URI.html#M004495

answered Mar 15 '12 at 09:02

behe

1,368
1
9
5

4

It fails on text with ":" http://blog.apptamers.com/post/48613650042/uri-extract-incorrect-in-ruby-1-9-3 – Łukasz Śliwa Apr 22 '13 at 15:37
16

`URI.extract(yourString, /http(s)?|mailto/)` – titibouboul Nov 15 '13 at 17:18
6

There is anyway to extract urls without schema? like www.example.com – Samuel G. P. Jun 02 '16 at 15:27
Appreciate the standard lib functionality, great for most cases. Worth noting the [postrank-uri](https://github.com/postrank-labs/postrank-uri) gem also has a similar extract method `PostRank::URI.extract(text)` which seems to handle more edge cases. – odlp Oct 20 '20 at 16:38

score 14 · Answer 2 · edited Feb 21 '15 at 13:24

14

I've used twitter-text gem

require "twitter-text"
class UrlParser
    include Twitter::Extractor
end

urls = UrlParser.new.extract_urls("http://stackoverflow.com")
puts urls.inspect

edited Feb 21 '15 at 13:24

dgo.a

2,634
23
35

answered Nov 04 '13 at 13:45

santervo

534
5
8

1

with newers versions you need to include include Twitter::TwitterText::Extractor, instead of include Twitter::Extractor – Yamit May 19 '20 at 20:00

score 9 · Answer 3 · answered Sep 08 '10 at 06:25

9

You can use regex and .scan()

string.scan(/(https?:\/\/([-\w\.]+)+(:\d+)?(\/([\w\/_\.]*(\?\S+)?)?)?)/)

You can get started with that regex and adjust it according to your needs.

answered Sep 08 '10 at 06:25

NullUserException

83,810
28
209
234

score 5 · Accepted Answer · answered Sep 08 '10 at 06:32

5

What cases are failing?

According to the library regexpert, you can use

regexp = /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix

and then perform a scan on the text.

EDIT: Seems like the regexp supports the empty string. Just remove the initial (^$) and you're done

answered Sep 08 '10 at 06:32

Chubas

17,823
4
48
48

1

Interesting how this regex fails when the URL is an IP address – NullUserException Sep 08 '10 at 06:52
1

Yep. I actually voted up on your answer because of the "and adjust it according to your needs". It also fails when present a user@password, or using other than https?, or any other weird situation. You probably wouldn't want to read http://tools.ietf.org/html/rfc3986 to get started -_- – Chubas Sep 08 '10 at 07:09
It fails as above. I am asking here just why i am unable to "and adjust it according to your needs". – tapioco123 Sep 08 '10 at 08:09
Using built in ruby methods shown in other answers seems to be a much cleaner solution. This probably shouldn't be selected as the best answer. – JohnSalzarulo Oct 09 '17 at 16:45

score 0 · Answer 5 · answered Sep 27 '15 at 08:11

If your input looks similar to this:

"http://i.imgur.com/c31IkbM.gifv;http://i.imgur.com/c31IkbM.gifvhttp://i.imgur.com/c31IkbM.gifv"

i.e. URLs do not necessarily have white space around them, can be delimited by any delimiter, or have no delimiter between them at all, you can use the following approach:

def process_images(raw_input)
  return [] if raw_input.nil?
  urls = raw_input.split('http')
  urls.shift
  urls.map { |url| "http#{url}".strip.split(/[\s\,\;]/)[0] }
end

Hope it helps!

Keon · Answer 6 · 2014-12-09T17:36:59.787

-2

require 'uri'    
foo = #<URI::HTTP:0x007f91c76ebad0 URL:http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg>
foo.to_s
=> "http://foobar/00u0u_gKHnmtWe0Jk_600x450.jpg"

edit: explanation

For those who are having problems parsing URI's through JSON responses or by using a scraping tool like Nokogiri or Mechanize, this solution worked for me.

edited Dec 09 '14 at 17:36

answered Dec 09 '14 at 09:56

Keon

1,741
3
17
27

Perhaps you should explain your answer? If it indeed is an answer? – Jensd Dec 09 '14 at 10:29
Additional explanation added. – Keon Dec 09 '14 at 17:37
1

This makes no sense and isn't syntactically correct. – the Tin Man Mar 01 '16 at 18:56

How to extract URLs from text

6 Answers6

Linked

Related