1

I am trying something that I found on another answer but I am having some problems:

I know that there are better regex for URLs but consider this for example:

@links=($content =~ m/(https?)?.*[.]com/g);
*$content has text or html

The part (https?)? is for links like www.google.com, but having the parenthesis it returns "http" to $1 which is put into @links! That is a problem, since I want the whole link.

What would globally extract simple links (or whatever regex is specified) from text and put them into a list?
By simple, I mean:

  • http://www.google.com
  • www.google.com
  • google.com
  • https://www.google.com
Community
  • 1
  • 1
fersarr
  • 3,399
  • 3
  • 28
  • 35
  • adding `?:` right after the opening parenthesis will make it non-capturing. does that help? – Martin Ender Oct 29 '12 at 00:34
  • Perhaps the following related topic will help: [How can I extract URL and link text from HTML in Perl?](http://stackoverflow.com/questions/254345/how-can-i-extract-url-and-link-text-from-html-in-perl) – Kenosis Oct 29 '12 at 00:44
  • perfectly! thanks! :) Im still open to hearing better alternatives @m.buettner – fersarr Oct 29 '12 at 00:44
  • @m.buettner Make your comment an answer - I think it's correct – Bohemian Oct 29 '12 at 00:44

2 Answers2

5

Your approach is too naive, it won't catch many other URLs. Instead, use Regexp::Common, like this:

use Regexp::Common qw/URI/;

my @links = ($content =~ /$RE{URI}/g);

This works for HTTP, HTTPS, FTP, etc and properly captures more advanced combinations for URL parameters.

mvp
  • 111,019
  • 13
  • 122
  • 148
  • Thank you! i have used these kind of modules before (when I tried to build some kind of crawler bot) but I just wanted to know about disabling the parenthesis for whatever purpose! Not for URLs in particular. Thanks though – fersarr Oct 29 '12 at 01:26
  • @fersarr if an answer has resolved the issue you were asking about you should accept the answer by clicking the *tick* icon below the up/down vote buttons. This will mark the issue as resolved and award points to the person answering the question thus motivating other people to answer your future questions. – Matti Lyra Nov 17 '12 at 17:05
3

Non-capturing version looks like this:

m/(?:https?)?.*[.]com/g

For capturing links, I use this regex, derived from URI::Find:

m<https?://[;/\?:\@&=+\$,\[\]A-Za-z0-9\-_.!~*'()%#]*[/\?:\@&=+\$\[A-Za-z0-9\-_!~*(%#]>
ysth
  • 96,171
  • 6
  • 121
  • 214