URL detection in a string

Question

Possible Duplicate:
Linkify Regex Function PHP Daring Fireball Method

I'm trying to get URLs from a string and i have this

$link_count = count(explode('http',$post));
if($link_count > 0){
    preg_match_all("#https?://[^.\s]+\.[^\s]+#ix", $note, $matches);
    foreach(){} etc..
}

The problem here is i also want to be able to detect URLs like this; http://www.link.com or www.link.com or link.com Ps: i've searched for similar questions on here but i haven't seen one that adresses all those type of URLs.

Thank you.

detecting link.com will be quite difficult (you may get results like "someword.It", the other two is easy — Oussama Jilal, Jan 06 '13 at 14:49
If they don't have a protocol prefix, then technically they're not links, but hostnames. You need to decide on constraints, as matching anything `\w+\.\S+` will lead to false positives. — mario, Jan 06 '13 at 14:51

score 2 · Answer 1 · answered Jan 06 '13 at 15:11

2

Try

"/\b((http(s)?:\/\/)?(www\.[a-zA-Z0-9\/\\\:\?\%\.\&\;=#\-\_\!\+\~\,]*))/is

And as mario said, a link without a protocol prefix is technically not a link.

answered Jan 06 '13 at 15:11

Tommy Adey

806
6
12

You don't need to escape all those characters. I took the liberty of cleaning it up for you, have a look here: http://regex101.com/r/yL5hA6 – Firas Dib Jan 06 '13 at 15:25

Oussama Jilal · Accepted Answer · 2013-01-06T15:55:56.417

1

Try this regular expression :

#(https?://)?([a-z0-9-]+\.)+[a-z0-9]+/?#i

edited Jan 06 '13 at 15:55

answered Jan 06 '13 at 14:52

Oussama Jilal

7,669
2
30
53

You don't really need the `(www\.)?` as it will be covered by `([a-z0-9-]+\.)+` right? Also don't set a limit on the length of the TLD as they are ever changing and putting a limit of 4 will block .travel .museum and IANA has plans for allowing companies to purchase their own TLD so there could possibly be .google for instance in the future. Checking for the first occurrence of of a word boundary should be fine. – kittycat Jan 06 '13 at 15:01
yes, you are right about the www., but about the TLD thing, I didn't know that they plan on doing what you said, thanks. – Oussama Jilal Jan 06 '13 at 15:04
https://en.wikipedia.org/wiki/Generic_top-level_domain#June_20.2C_2011_vote_on_expansion_of_gTLDs looks like many will go live this year. Here is more info about ones that have been applied for https://www.pcworld.com/article/257430/the_top_10_proposed_new_top_level_domains_so_far.html I would definitely change your regex as it will fail on many domains once this year is over. – kittycat Jan 06 '13 at 15:10
Oh wow, @cryptic do you have any idea when they'll go live? – Tommy Adey Jan 06 '13 at 15:14
2

@TommyAdey apparently they will begin this year, but don't know which of the 2,000 applications will be approved, and that is just the first round I expect. I don't like the whole idea of it, I think IANA is using their authority to profit, and in the process the Internet will become even more branded and commercialized. =o\ – kittycat Jan 06 '13 at 15:16
So technically, it won't be available for everyone to buy just yet? – Tommy Adey Jan 06 '13 at 15:18
@TommyAdey, no you need deep pockets to purchase a gTLD (generic top-level domain), its tens of thousands of dollars a year. – kittycat Jan 06 '13 at 15:19
Darn it!, i was hoping to cash in a bit on domain flipping :( – Tommy Adey Jan 06 '13 at 15:21
Thanks, it worked. Should i remove the {2,4} tho? considering what cryptic is saying? – Saff Jan 06 '13 at 15:23
Ok I edited the regex depending on what cryptic said. – Oussama Jilal Jan 06 '13 at 15:56

URL detection in a string

2 Answers2