There are literally hundreds of question here on SE ( and on the web in general ) regarding this issue - and I tried a LOT But I can not find the Ultimate catch-all regex expression.
Feel free to jump to the The TL;DR version below...
I need to parse a string to catch
all URLS.
I am using this now ( closest I got to work)
$content = preg_replace_callback( '/((http[s]?:|www[.])[^\s]*)/i', 'my_callback', $content );
Problem is - it is not catching ALL urls ..
http://designscrazed.com/personal-wordpress-blog-themes/ <-- OK
https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template <-- OK
www.tuicool.com/articles/rqAzU3 <-- OK
html5up.net/overflow/ <-- NOT WORKING
http://www.tuicool.com/articles/rqAzU3 <-- OK
http://live.btoa.com.au/spotfinder/docs/#ByVCPlik <-- OK
www.designrazzi.com/2013/free-css3-html5-templates/ <-- OK
themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ <-- NOT WORKING
I also tried without the WWW
$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'my_callback', $content );
and even
$content = preg_replace_callback( '#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#i', 'my_callback', $content );
All three cases do not work for urls wrapped in HTML link ...
For example , in a link like
<a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>
it will catch the url almost correctly , but will leave the HTML part AFTER ..
http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>
producing
THIS WAS CAUGHT" target="_blank">SE</a>
The TL;DR version :
I basically need a regex to catch ALL urls , in a clean way of the variants :
http://www.example.com
http://example.com/
http://www.example.com/seconday/somepage#hashes?parameters
http://www.example.com/seconday/
http://www.example.com/seconday
http://example.com/seconday
http://example.com/seconday/
All of the above with http
, https
or without protocol prefix ( e.g. example.com/seconday
).
On top of that - all of those can be wrapped in HTML like
http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank" some_attribute='somevalue' >SE</a>
EDIT I ( after comments)
I write can because some are also "free standing" where methods like Dom parsing with DOMDocument or SimpleHTMLDOM would fail because they are not inside an HTML tag <a>
or do not have href
attributes ( like in comment - Think of parsing this very own page with this question itself. How can DOM parsing catch the URLS that are inside a <code>
tag ? )
` tag for example or no particular tag at all (
– Obmerk Kronen Mar 08 '14 at 07:03, ). Think of parsing this very question itself or the whole page it is on. Suddenly parsing the DOM is not really so great an option - is it ? ( Maybe I need to think of a combination of methods after all, it is a pretty common question/problem )