Get all urls in a string with php

Question

I'm trying to figure out a way to get an array of URLs from a string of text. The text will be somewhat formatted like this:

Some random text up here

http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/

Obviously, those links can be anything (and there can be many links, those are just the ones I'm testing with now. If I use a simple URL like my regex works fine.

I am using:

preg_match_all('((https?|ftp|gopher|telnet|file|notes|ms-help):'.
    '((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)',
    $bodyMessage, $matches, PREG_PATTERN_ORDER);

When I do a print_r( $matches); the result I get is:

Array ( [0] => Array (
    [0] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon=
    [1] => http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= 
    [2] => http://techcrunch.co=
    [3] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-ip= 
    [4] => http://techcrunch.com/2012/07/20/last-day-to-purc=
    [5] => http://tec=
)
...

None of those items in that array are full links from the links above.

Anyone know of a good way to get what I need? I've found a bunch of regex stuff to get links for PHP, but none of it works.

Thanks!

Edit:

Ok, so i'm pulling these links from an e-mail. The script parses the email, grabs the body of the message, and then tries to grab the links from that. After investigating the email, it appears as if it is for some reason adding a space in the middle of the url. Here is the output of the body message as seen by my PHP script.

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Any suggestions on how to make it not break the URLS?

EDIT 2

As per Laurnet's suggestion, I ran this code:

 $bodyMessage = str_replace("= ", "",$bodyMessage);

However when I echo that out, it doesn't seem to want to replace "= "

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Hmm interesting... I just edited my question... the links are coming from an email that I then parse to get the body message... it seems like the email is putting a space right in the middle of the link! suggestions? — Bill, Jul 21 '12 at 00:46
Those instances of `= ` look suspiciously like some sort of chunked encoding that your code is not properly handling. — mellamokb, Jul 21 '12 at 00:48
I'd just replace all "`= `" with nothing before processing the string. — laurent, Jul 21 '12 at 00:49
Good point. See my newest edit... string replace doesn't seem to want to work on it though — Bill, Jul 21 '12 at 01:01
Still works just fine: http://ideone.com/00GOb. It's probably not simply an equals followed by space in your actual string. You may need to do a hex-dump of your string to see what actual characters are there. See for example here how to convert string to hex: http://stackoverflow.com/questions/1057572/how-can-i-get-a-hex-dump-of-a-string-in-php. — mellamokb, Jul 21 '12 at 01:08
Just a thought, but the "= " must be a standard encoding. I bet if you found out what that encoding is called you could find some code to do it for you just right. — John, Nov 04 '16 at 23:40

score 9 · Answer 1 · answered Jul 21 '12 at 00:48

9

    /**
     *
     * @get URLs from string (string maybe a url)
     *
     * @param string $string

     * @return array
     *
     */
    function getUrls($string) {
        $regex = '/https?\:\/\/[^\" ]+/i';
        preg_match_all($regex, $string, $matches);
        //return (array_reverse($matches[0]));
        return ($matches[0]);
}

answered Jul 21 '12 at 00:48

Eswar Rajesh Pinapala

4,841
4
32
40

1

You should also add the new line to the negation `$regex = '/https?\:\/\/[^\" \n]+/i';` – unloco Apr 25 '15 at 18:36

Dhananjay Yadav · Answer 2 · 2021-08-27T07:55:35.417

Using following code you will find an array urls_in_string and at zero index $urls_in_string[0], you will find all the urls.

    $urls_in_string = [];
    $string_with_urls = "Worlds most popular socila networking website in https://www.facebook.com. We have many such othe websites like https://twitter.com/home and https://www.linkedin.com/feed/ etc.";
    $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/\S*)?/im";
    preg_match_all($reg_exUrl, $string_with_urls, $urls_in_string);
    print_r($urls_in_string);




// OutPut 
/*
Array
(
    [0] => Array
        (
            [0] => https://www.facebook.com
            [1] => https://twitter.com/home
            [2] => https://www.linkedin.com/feed/
        )

    [1] => Array
        (
            [0] => https
            [1] => https
            [2] => https
        )

    [2] => Array
        (
            [0] => 
            [1] => /home
            [2] => /feed/
        )

)
*/

score 0 · Answer 3 · answered Jul 04 '13 at 15:42

Use the following regular expression instead.

$regex = "(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))";

Hope it helps.

score 0 · Answer 4 · answered Aug 27 '21 at 08:59

you can do something like following

$url = "http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/";

$dataArray = explode("http",$url);

echo "<pre>";print_r($dataArray);

this will return like following array

Array
(
 [0] => 
 [1] => ://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~


 [2] => ://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/
)

when you extract above output please prepend http, I think this will help you 

Happy Coding

Get all urls in a string with php

4 Answers4

Linked