5

I'm trying to figure out a way to get an array of URLs from a string of text. The text will be somewhat formatted like this:

Some random text up here

http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/

Obviously, those links can be anything (and there can be many links, those are just the ones I'm testing with now. If I use a simple URL like my regex works fine.

I am using:

preg_match_all('((https?|ftp|gopher|telnet|file|notes|ms-help):'.
    '((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)',
    $bodyMessage, $matches, PREG_PATTERN_ORDER);

When I do a print_r( $matches); the result I get is:

Array ( [0] => Array (
    [0] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon=
    [1] => http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= 
    [2] => http://techcrunch.co=
    [3] => http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-ip= 
    [4] => http://techcrunch.com/2012/07/20/last-day-to-purc=
    [5] => http://tec=
)
...

None of those items in that array are full links from the links above.

Anyone know of a good way to get what I need? I've found a bunch of regex stuff to get links for PHP, but none of it works.

Thanks!

Edit:

Ok, so i'm pulling these links from an e-mail. The script parses the email, grabs the body of the message, and then tries to grab the links from that. After investigating the email, it appears as if it is for some reason adding a space in the middle of the url. Here is the output of the body message as seen by my PHP script.

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 

Any suggestions on how to make it not break the URLS?

EDIT 2

As per Laurnet's suggestion, I ran this code:

 $bodyMessage = str_replace("= ", "",$bodyMessage);

However when I echo that out, it doesn't seem to want to replace "= "

 --00248c711bb99ca36d04c54ba5c6 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphon= es-bezel-a-massive-notification-light/?grcc=3D88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2= =3D835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033f= deed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~ http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tick= ets-for-disrupt-sf/ --00248c711bb99ca36d04c54ba5c6 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable 
Bill
  • 5,478
  • 17
  • 62
  • 95
  • Looks fine to me: http://ideone.com/ulJ4a. – mellamokb Jul 21 '12 at 00:39
  • Hmm interesting... I just edited my question... the links are coming from an email that I then parse to get the body message... it seems like the email is putting a space right in the middle of the link! suggestions? – Bill Jul 21 '12 at 00:46
  • Those instances of `= ` look suspiciously like some sort of chunked encoding that your code is not properly handling. – mellamokb Jul 21 '12 at 00:48
  • I'd just replace all "`= `" with nothing before processing the string. – laurent Jul 21 '12 at 00:49
  • Good point. See my newest edit... string replace doesn't seem to want to work on it though – Bill Jul 21 '12 at 01:01
  • @mellamokb, is there anyway to fix this? See my newest edit – Bill Jul 21 '12 at 01:04
  • Still works just fine: http://ideone.com/00GOb. It's probably not simply an equals followed by space in your actual string. You may need to do a hex-dump of your string to see what actual characters are there. See for example here how to convert string to hex: http://stackoverflow.com/questions/1057572/how-can-i-get-a-hex-dump-of-a-string-in-php. – mellamokb Jul 21 '12 at 01:08
  • Just a thought, but the "= " must be a standard encoding. I bet if you found out what that encoding is called you could find some code to do it for you just right. – John Nov 04 '16 at 23:40

4 Answers4

9
    /**
     *
     * @get URLs from string (string maybe a url)
     *
     * @param string $string

     * @return array
     *
     */
    function getUrls($string) {
        $regex = '/https?\:\/\/[^\" ]+/i';
        preg_match_all($regex, $string, $matches);
        //return (array_reverse($matches[0]));
        return ($matches[0]);
}
Eswar Rajesh Pinapala
  • 4,841
  • 4
  • 32
  • 40
1

Using following code you will find an array urls_in_string and at zero index $urls_in_string[0], you will find all the urls.

    $urls_in_string = [];
    $string_with_urls = "Worlds most popular socila networking website in https://www.facebook.com. We have many such othe websites like https://twitter.com/home and https://www.linkedin.com/feed/ etc.";
    $reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,6}(\/\S*)?/im";
    preg_match_all($reg_exUrl, $string_with_urls, $urls_in_string);
    print_r($urls_in_string);




// OutPut 
/*
Array
(
    [0] => Array
        (
            [0] => https://www.facebook.com
            [1] => https://twitter.com/home
            [2] => https://www.linkedin.com/feed/
        )

    [1] => Array
        (
            [0] => https
            [1] => https
            [2] => https
        )

    [2] => Array
        (
            [0] => 
            [1] => /home
            [2] => /feed/
        )

)
*/
Dhananjay Yadav
  • 291
  • 2
  • 9
0

Use the following regular expression instead.

$regex = "(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))";

Hope it helps.

sagunms
  • 8,030
  • 5
  • 41
  • 43
0
you can do something like following

$url = "http://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~

http://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/";

$dataArray = explode("http",$url);

echo "<pre>";print_r($dataArray);

this will return like following array

Array
(
 [0] => 
 [1] => ://techcrunch.com/2012/07/20/kickstarter-flashr-wants-to-make-the-iphones-bezel-a-massive-notification-light/?grcc=88888Z0ZwdgtZ0Z0Z0Z0Z0&grcc2=835637c33f965e6cdd34c87219233711~1342828462249~fca4fa8af1286d8a77f26033fdeed202~510f37324b14c50a5e9121f955fac3fa~1342747216490~0~0~0~0~0~0~0~0~7~3~


 [2] => ://techcrunch.com/2012/07/20/last-day-to-purchase-extra-early-bird-tickets-for-disrupt-sf/
)

when you extract above output please prepend http, I think this will help you 

Happy Coding
Vishal_VE
  • 1,852
  • 1
  • 6
  • 9