Perl regex disable parenthesis extraction

Question

I am trying something that I found on another answer but I am having some problems:

I know that there are better regex for URLs but consider this for example:

@links=($content =~ m/(https?)?.*[.]com/g);
*$content has text or html

The part (https?)? is for links like www.google.com, but having the parenthesis it returns "http" to $1 which is put into @links! That is a problem, since I want the whole link.

What would globally extract simple links (or whatever regex is specified) from text and put them into a list?
By simple, I mean:

http://www.google.com
www.google.com
google.com
https://www.google.com

adding `?:` right after the opening parenthesis will make it non-capturing. does that help? — Martin Ender, Oct 29 '12 at 00:34
Perhaps the following related topic will help: [How can I extract URL and link text from HTML in Perl?](http://stackoverflow.com/questions/254345/how-can-i-extract-url-and-link-text-from-html-in-perl) — Kenosis, Oct 29 '12 at 00:44
perfectly! thanks! :) Im still open to hearing better alternatives @m.buettner — fersarr, Oct 29 '12 at 00:44
@m.buettner Make your comment an answer - I think it's correct — Bohemian, Oct 29 '12 at 00:44

score 5 · Answer 1 · answered Oct 29 '12 at 00:57

5

Your approach is too naive, it won't catch many other URLs. Instead, use Regexp::Common, like this:

use Regexp::Common qw/URI/;

my @links = ($content =~ /$RE{URI}/g);

This works for HTTP, HTTPS, FTP, etc and properly captures more advanced combinations for URL parameters.

answered Oct 29 '12 at 00:57

mvp

111,019
13
122
148

Thank you! i have used these kind of modules before (when I tried to build some kind of crawler bot) but I just wanted to know about disabling the parenthesis for whatever purpose! Not for URLs in particular. Thanks though – fersarr Oct 29 '12 at 01:26
@fersarr if an answer has resolved the issue you were asking about you should accept the answer by clicking the *tick* icon below the up/down vote buttons. This will mark the issue as resolved and award points to the person answering the question thus motivating other people to answer your future questions. – Matti Lyra Nov 17 '12 at 17:05

score 3 · Accepted Answer · answered Oct 29 '12 at 02:45

3

Non-capturing version looks like this:

m/(?:https?)?.*[.]com/g

For capturing links, I use this regex, derived from URI::Find:

m<https?://[;/\?:\@&=+\$,\[\]A-Za-z0-9\-_.!~*'()%#]*[/\?:\@&=+\$\[A-Za-z0-9\-_!~*(%#]>

answered Oct 29 '12 at 02:45

ysth

96,171
6
121
214

Perl regex disable parenthesis extraction

2 Answers2