0

There are literally hundreds of question here on SE ( and on the web in general ) regarding this issue - and I tried a LOT But I can not find the Ultimate catch-all regex expression.

Feel free to jump to the The TL;DR version below...

I need to parse a string to catch all URLS.

I am using this now ( closest I got to work)

$content = preg_replace_callback( '/((http[s]?:|www[.])[^\s]*)/i', 'my_callback', $content );

Problem is - it is not catching ALL urls ..

    http://designscrazed.com/personal-wordpress-blog-themes/ <-- OK
    https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template <-- OK
    www.tuicool.com/articles/rqAzU3   <-- OK
    html5up.net/overflow/   <-- NOT WORKING
    http://www.tuicool.com/articles/rqAzU3    <-- OK
    http://live.btoa.com.au/spotfinder/docs/#ByVCPlik   <-- OK
    www.designrazzi.com/2013/free-css3-html5-templates/    <-- OK
    themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/   <-- NOT WORKING

I also tried without the WWW

$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'my_callback', $content );

and even

 $content = preg_replace_callback( '#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#i', 'my_callback', $content );

All three cases do not work for urls wrapped in HTML link ...

For example , in a link like

 <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

it will catch the url almost correctly , but will leave the HTML part AFTER ..

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

producing

THIS WAS CAUGHT" target="_blank">SE</a>

The TL;DR version :

I basically need a regex to catch ALL urls , in a clean way of the variants :

http://www.example.com
http://example.com/
http://www.example.com/seconday/somepage#hashes?parameters
http://www.example.com/seconday/
http://www.example.com/seconday
http://example.com/seconday
http://example.com/seconday/

All of the above with http, https or without protocol prefix ( e.g. example.com/seconday ).

On top of that - all of those can be wrapped in HTML like

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank" some_attribute='somevalue' >SE</a>

EDIT I ( after comments)

I write can because some are also "free standing" where methods like Dom parsing with DOMDocument or SimpleHTMLDOM would fail because they are not inside an HTML tag <a> or do not have href attributes ( like in comment - Think of parsing this very own page with this question itself. How can DOM parsing catch the URLS that are inside a <code> tag ? )

Obmerk Kronen
  • 15,619
  • 16
  • 66
  • 105

1 Answers1

0

Okay, so I took a stab at this and came up with the following REGEX. I'm sure it's not going to catch everything, but it does seem to catch all of the URLs you've listed on this page. Here is an example:

// HERE IT IS LOOPING THROUGH AN ARRAY
$url_array = array('http://www.example.com', 'http://example.com/', 'http://www.example.com/seconday/somepage#hashes?parameters', 'http://www.example.com/seconday/', 'http://www.example.com/seconday', 'http://example.com/seconday', 'http://example.com/seconday/', 'http://designscrazed.com/personal-wordpress-blog-themes/', 'https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template', 'www.tuicool.com/articles/rqAzU3', 'html5up.net/overflow/', 'http://www.tuicool.com/articles/rqAzU3', 'http://live.btoa.com.au/spotfinder/docs/#ByVCPlik', 'www.designrazzi.com/2013/free-css3-html5-templates/', 'themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/', '<a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>');

$extension_array = array('com', 'net', 'org', 'biz');

foreach ($url_array AS $url) {

    print '<br>'.$url;
    if (preg_match('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url, $m)) {
        print "<pre><font color='orange'>"; print_r($m); print "</font></pre>";
    }

}

Or here is the same thing but using a string of text like what you are actually working with:

$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ionsipn  http://www.example.com/seconday/somepage#hashes?parameters opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday wwwe http://example.com/seconday               http://example.com/seconday/ 00000002222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeorop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;sl2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ asdf themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l  www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';

$extension_array = array('com', 'net', 'org', 'biz');

if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url_string, $m)) {
    print "<pre><font color='red'>"; print_r($m); print "</font></pre>";
}

Also, you can add the pattern modifiers 'ms' if you are searching through multiple lines instead of a single line like I have in my example.

EDIT:

There is an error in the previous code where I am calling $url_string in my matching line, when I had named the variable $urls_as_string when I set the content. If you correct the variable name, it should work as expected.

Anyway, I took the code above and modified it to work with preg_replace_callback like you had requested. This seems to work with all of the URLs that you had listed. Check it out:

// CREATE THE STRING
$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ion
sipn  http://www.example.com/seconday/somepage#hashes?parameters 



opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday ww
we http://example.com/seconday               http://example.com/seconday/ 000000
02222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeo
rop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;s
l2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ as
df themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l 
 www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';


// SET SOME DOMAIN EXTENSIONS
$extension_array = array('com', 'net', 'org', 'biz');



// CHECK TO SEE IF OUR REGEX IS WORKING ... PRINT OUT ALL OF THE MATCHES
if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', $urls_as_string, $m)) {
    print_r($m);
}



// USE PREG_REPLACE_CALLBACK TO FORMAT THE URLS
$content = preg_replace_callback( '~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', 'my_callback', $urls_as_string);



// PRINT OUT THE FINISHED STRING
print "\n\n\n\nFINAL OUTPUT: \n".$content;



// THIS FUNCTION DOES A CRAPTASTIC JOB AT FORMATTING URLS
function my_callback($m) {

    $url = $m[0];
    $url_formatted = $url;

    if (!preg_match('~^http(s)?://~', $url)) {
        $url_formatted = 'http://'.$url;
    }

    $url_formatted = '<a href="'.$url.'">'.$url.'</a>';

    return $url_formatted;

}

Here is a working demo of the code

The callback function I wrote is pretty stupid, but I'm assuming you already have a function that you are going to use. This is just to demonstrate that it is doing what it's supposed to be doing. Hopefully this solution solves your problem. If not, let me know and I can work on it some more.

Quixrick
  • 3,190
  • 1
  • 14
  • 17
  • Hi.. Thanks for trying it ! . I have indeed tried your code , and it worked in the first version "as is", but somehow I could not manage to get it to work with `preg_replace_callback` - It always missed the firs URL slash (`/`). Not sure why. Your later code ( after edit ) did not managed to return nothing to me ( always in `preg_replace_callback` ) . I tried also to append the `m` modifier for multiline . However - it seems to me that you are the closest to cracking this problem thank all the regexe I found loose on the net .. – Obmerk Kronen Mar 12 '14 at 01:22
  • Okay, I made an edit to my previous code to try and get you what you are wanting. Hopefully I am on the right track here. Let me know if you still have issues. – Quixrick Mar 12 '14 at 18:36