3

I have a string that conaint URLs and other texts. I want to get all the URLs in to $matches array. But the following code wouldn't get all the URLs in to $matches array:

$matches = array();
$text = "soundfly.us schoollife.edu hello.net some random news.yahoo.com text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988 and others will en.wikipedia.org/wiki/Country_music URL";
preg_match_all('$\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]$i', $text, $matches);
print_r($matches);

Above code will get:

http://tinyurl.com/9uxdwc
http://google.com
http://tinyurl.com/787988

.

but misses the following 4 URLs:

schoollife.edu 
hello.net 
news.yahoo.com
en.wikipedia.org/wiki/Country_music

Can you please tell me with an example, how can I modify above code to get all the URLs

Justin k
  • 1,104
  • 5
  • 22
  • 33
  • 1
    Your regex forces a http/https/ftp/file protocol to be specified. Make it optional. – sevenseacat Apr 27 '13 at 08:11
  • 1
    @sevenseacat I was having a similar problem too. Can you please show an example with the modified regex? – Learner_51 Apr 27 '13 at 08:45
  • @RakeshSharma Thanks a lot for the reply. But your answer will out put `some`, `random` as valid answers too. but the original question is trying to get only the URL's in that string. ($matches array should only hold web-address). – Learner_51 Apr 27 '13 at 09:03
  • see my updated answer hope this is what you find – Rakesh Sharma Apr 27 '13 at 09:32
  • You will need to match domain names (hostname) and also IP addresses. Perhaps you can refer to [this question](http://stackoverflow.com/q/106179/1386111) (you need to remove the beginning and ending anchors) It will also be followed by a file path. – Alvin Wong Apr 27 '13 at 09:36

1 Answers1

1

Is this what you need?

$matches = array();
$text = "soundfly.us schoollife.edu hello.net some random news.yahoo.com text http://tinyurl.com/9uxdwc some http://google.com random text http://tinyurl.com/787988 and others will en.wikipedia.org/wiki/Country_music URL";
preg_match_all('$\b((https?|ftp|file)://)?[-A-Z0-9+&@#/%?=~_|!:,.;]*\.[-A-Z0-9+&@#/%=~_|]+$i', $text, $matches);
print_r($matches);

I made the protocol part optionnal, add the use of a dot spliting the domain and the TLD and a "+" to get the full string after that dot (TLD + extra informations)

Result is:

[0] => soundfly.us 
[1] => schoollife.edu 
[2] => hello.net 
[3] => news.yahoo.com 
[4] => http://tinyurl.com/9uxdwc 
[5] => http://google.com 
[6] => http://tinyurl.com/787988 
[7] => en.wikipedia.org/wiki/Country_music

Also works with IP address because of the mandatory presence of a dot into. Tested with string "192.168.0.1" and "192.168.0.1/test/index.php"

niconoe
  • 1,191
  • 1
  • 11
  • 25