Common-sense URL extraction in Ruby

Question

Consider a URL embedded in plain text, such as http://example.com/. Stack Overflow is smart enough to know that I didn't mean to include the period at the end as part of the URL, even though . is an unreserved character according to RFC 3986.

Likewise, if I type http://example.org/, Stack Overflow is smart enough to know that I didn't mean to include the comma, even though as a member of the sub-delims class, , is a valid path character.

Ruby's URI.extract(), as suggested in this and this highly-voted answer, is not as smart as Stack Overflow.

2.2.5 :002 > URI.extract('...such as http://example.com/.')
 => ["http://example.com/."] 
2.2.5 :003 > URI.extract('Likewise, if I type http://example.org/, Stack Overflow...')
 => ["http://example.org/,"]

Is there a smarter alternative?

So a URI could potentially contain both `,` and `.` which is probably why `extract` includes them. If you were looking to customize the behavior you could pass extract a block and then maybe check against the extracted string to see if it has any values that you would consider invalid. http://ruby-doc.org/stdlib-2.0.0/libdoc/uri/rdoc/URI.html#method-c-extract-label-Synopsis https://stackoverflow.com/questions/7109143/what-characters-are-valid-in-a-url It is annoying that you even have to consider doing anything weird here though — Charlie L, Aug 01 '17 at 22:42
Yeah, I read the RFC. I'm more looking for somebody who's already thought through the common-sense use cases, so I'm not playing whack-a-mole. — David Moles, Aug 01 '17 at 22:44
I don't know but +1 on the question. I'm sure there's a well thought out regex that will solve the problem in combination with `URI.extract(..., &block)`. I haven't thought it out very well though — m. simon borg, Aug 02 '17 at 00:27

Common-sense URL extraction in Ruby

0 Answers0