Regex to find URL in string wrapped in HTML or not

Question

There are literally hundreds of question here on SE ( and on the web in general ) regarding this issue - and I tried a LOT But I can not find the Ultimate catch-all regex expression.

Feel free to jump to the The TL;DR version below...

I need to parse a string to catch all URLS.

I am using this now ( closest I got to work)

$content = preg_replace_callback( '/((http[s]?:|www[.])[^\s]*)/i', 'my_callback', $content );

Problem is - it is not catching ALL urls ..

    http://designscrazed.com/personal-wordpress-blog-themes/ <-- OK
    https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template <-- OK
    www.tuicool.com/articles/rqAzU3   <-- OK
    html5up.net/overflow/   <-- NOT WORKING
    http://www.tuicool.com/articles/rqAzU3    <-- OK
    http://live.btoa.com.au/spotfinder/docs/#ByVCPlik   <-- OK
    www.designrazzi.com/2013/free-css3-html5-templates/    <-- OK
    themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/   <-- NOT WORKING

I also tried without the WWW

$content = preg_replace_callback( '/(http[s]?:[^\s]*)/i', 'my_callback', $content );

and even

 $content = preg_replace_callback( '#[-a-zA-Z0-9@:%_\+.~\#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~\#?&//=]*)?#i', 'my_callback', $content );

All three cases do not work for urls wrapped in HTML link ...

For example , in a link like

 <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

it will catch the url almost correctly , but will leave the HTML part AFTER ..

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>

producing

THIS WAS CAUGHT" target="_blank">SE</a>

The TL;DR version :

I basically need a regex to catch ALL urls , in a clean way of the variants :

http://www.example.com
http://example.com/
http://www.example.com/seconday/somepage#hashes?parameters
http://www.example.com/seconday/
http://www.example.com/seconday
http://example.com/seconday
http://example.com/seconday/

All of the above with http, https or without protocol prefix ( e.g. example.com/seconday ).

On top of that - all of those can be wrapped in HTML like

http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank" some_attribute='somevalue' >SE</a>

EDIT I ( after comments)

I write can because some are also "free standing" where methods like Dom parsing with DOMDocument or SimpleHTMLDOM would fail because they are not inside an HTML tag <a> or do not have href attributes ( like in comment - Think of parsing this very own page with this question itself. How can DOM parsing catch the URLS that are inside a <code> tag ? )

There are literally hundreds of answers telling you not to use regular expressions to parse HTML. Why are you still trying to do it? — Nate C-K, Mar 08 '14 at 05:54
@NateC-K Yep, on the many questions I read , I saw that many claim parsing is better than regex . But I have only a string from DB . Another point is that not all urls are HTML. parsing the DOM is equally hard to find those. last issue - runtime speed is a bit of an issue. — Obmerk Kronen, Mar 08 '14 at 05:55
This is a good place to start: http://www.php.net/manual/en/domdocument.loadhtml.php — Nate C-K, Mar 08 '14 at 06:17
The reason you are not getting answers is because the way to do this (as stated) is to parse the Dom for anchors. `getElementsByTagName('a')` is gonna catch whatever crazy URIs are present. — ficuscr, Mar 08 '14 at 06:47
@ficuscr . `getElementsByTagName('a')` will only catch those URLS that are inside an HTML `` tag ( standard links ) . It will not catch the URLS that are "free" or inside a `` tag for example or no particular tag at all ( , ). Think of parsing this very question itself or the whole page it is on. Suddenly parsing the DOM is not really so great an option - is it ? ( Maybe I need to think of a combination of methods after all, it is a pretty common question/problem ) — Obmerk Kronen, Mar 08 '14 at 07:03
@NateC-K Can you please elaborate on how [DomDocument](http://www.php.net/manual/en/domdocument.loadhtml.php) can help in parsing "free standing" links that are not inside an `` tag or with `href` attribute ? It might be the solution I am looking for . — Obmerk Kronen, Mar 08 '14 at 07:07
Right. Can you describe this regex rule in English? Sounds like you want to match URIs but don't have a good way to identify them. What. .. "Has a dot and more than 2 letters"? This is probably a good start http://stackoverflow.com/questions/6191720/regular-expression-to-match-generic-url Look at answer by Nicholas Carey — ficuscr, Mar 08 '14 at 07:08
@ficuscr . thanks for the link , I will look into it :-) . in plain English the rule would be ( i think ) `"preceding protocol or not (with ://) followed by (repeatable (more or equal to 1 letter, followed by a dot followed by more or euqal to 2 letters) followed by anything until a space )"` — Obmerk Kronen, Mar 08 '14 at 07:14
@ObmerkKronen: I don't think there's a way to write a perfect scanner to lift all HTTP Links out of free text. There are just too many ways that they can be expressed. Any dot or slash could potentially be an indicator that you're dealing with a URL. One helpful rule I can think of is to look for the typical top-level domains (.com, .org, etc.) and pick out strings that look like URLs and including those pieces. — Nate C-K, Mar 08 '14 at 17:54
One approach you might try is to use your regex to scan for any string that might be a URL and then analyze it with some other tests to see if it really is a URL (or part of it is). — Nate C-K, Mar 08 '14 at 17:56
Another possible approach is to pick out anything you think might be a URL, parse out what you think should be the domain, and try to resolve the domain using a DNS server. Then discard domain names that don't resolve. You might try this only with more dubious matches, like maybe just ones that don't have http:// at the beginning. — Nate C-K, Mar 08 '14 at 17:59

Quixrick · Answer 1 · 2014-03-12T18:35:27.873

Okay, so I took a stab at this and came up with the following REGEX. I'm sure it's not going to catch everything, but it does seem to catch all of the URLs you've listed on this page. Here is an example:

// HERE IT IS LOOPING THROUGH AN ARRAY
$url_array = array('http://www.example.com', 'http://example.com/', 'http://www.example.com/seconday/somepage#hashes?parameters', 'http://www.example.com/seconday/', 'http://www.example.com/seconday', 'http://example.com/seconday', 'http://example.com/seconday/', 'http://designscrazed.com/personal-wordpress-blog-themes/', 'https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template', 'www.tuicool.com/articles/rqAzU3', 'html5up.net/overflow/', 'http://www.tuicool.com/articles/rqAzU3', 'http://live.btoa.com.au/spotfinder/docs/#ByVCPlik', 'www.designrazzi.com/2013/free-css3-html5-templates/', 'themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/', '<a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>');

$extension_array = array('com', 'net', 'org', 'biz');

foreach ($url_array AS $url) {

    print '<br>'.$url;
    if (preg_match('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url, $m)) {
        print "<pre><font color='orange'>"; print_r($m); print "</font></pre>";
    }

}

Or here is the same thing but using a string of text like what you are actually working with:

$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ionsipn  http://www.example.com/seconday/somepage#hashes?parameters opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday wwwe http://example.com/seconday               http://example.com/seconday/ 00000002222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeorop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;sl2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ asdf themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l  www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';

$extension_array = array('com', 'net', 'org', 'biz');

if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~i', $url_string, $m)) {
    print "<pre><font color='red'>"; print_r($m); print "</font></pre>";
}

Also, you can add the pattern modifiers 'ms' if you are searching through multiple lines instead of a single line like I have in my example.

EDIT:

There is an error in the previous code where I am calling $url_string in my matching line, when I had named the variable $urls_as_string when I set the content. If you correct the variable name, it should work as expected.

Anyway, I took the code above and modified it to work with preg_replace_callback like you had requested. This seems to work with all of the URLs that you had listed. Check it out:

// CREATE THE STRING
$urls_as_string = 'asd a http://www.example.com w223 http://example.com/  ion
sipn  http://www.example.com/seconday/somepage#hashes?parameters 



opajiw348283 http://www.example.com/seconday/ 20923[\'#$%#$ http://www.example.com/seconday ww
we http://example.com/seconday               http://example.com/seconday/ 000000
02222 http://designscrazed.com/personal-wordpress-blog-themes/ +_)(&^&%$ https://creativemarket.com/nikokolev/7993-Kubrat-Responsive-Template oopeo
rop  www.tuicool.com/articles/rqAzU3 03083 2h1hh1`  html5up.net/overflow/ kksllkwpo2 http://www.tuicool.com/articles/rqAzU3  la;s
l2i2i3okn2 http://live.btoa.com.au/spotfinder/docs/#ByVCPlik black cat www.designrazzi.com/2013/free-css3-html5-templates/ as
df themeko.org/halsey-v1-1-9-ultimate-business-wordpress-theme/ l 
 www <a href="http://wordpress.stackexchange.com/questions/124977/how-to-add-qtranslate-multi-language-support-for-media/131971#131971" target="_blank">SE</a>';


// SET SOME DOMAIN EXTENSIONS
$extension_array = array('com', 'net', 'org', 'biz');



// CHECK TO SEE IF OUR REGEX IS WORKING ... PRINT OUT ALL OF THE MATCHES
if (preg_match_all('~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', $urls_as_string, $m)) {
    print_r($m);
}



// USE PREG_REPLACE_CALLBACK TO FORMAT THE URLS
$content = preg_replace_callback( '~(?:(?:http(?:s)?://)?(?:www\.)?[-A-Z0-9.]+(?:\.'.implode('|', $extension_array).')[-A-Z0-9_./]?(?:[-A-Z0-9#?/]+)?)~ims', 'my_callback', $urls_as_string);



// PRINT OUT THE FINISHED STRING
print "\n\n\n\nFINAL OUTPUT: \n".$content;



// THIS FUNCTION DOES A CRAPTASTIC JOB AT FORMATTING URLS
function my_callback($m) {

    $url = $m[0];
    $url_formatted = $url;

    if (!preg_match('~^http(s)?://~', $url)) {
        $url_formatted = 'http://'.$url;
    }

    $url_formatted = '<a href="'.$url.'">'.$url.'</a>';

    return $url_formatted;

}

Here is a working demo of the code

The callback function I wrote is pretty stupid, but I'm assuming you already have a function that you are going to use. This is just to demonstrate that it is doing what it's supposed to be doing. Hopefully this solution solves your problem. If not, let me know and I can work on it some more.

Hi.. Thanks for trying it ! . I have indeed tried your code , and it worked in the first version "as is", but somehow I could not manage to get it to work with `preg_replace_callback` - It always missed the firs URL slash (`/`). Not sure why. Your later code ( after edit ) did not managed to return nothing to me ( always in `preg_replace_callback` ) . I tried also to append the `m` modifier for multiline . However - it seems to me that you are the closest to cracking this problem thank all the regexe I found loose on the net .. — Obmerk Kronen, Mar 12 '14 at 01:22
Okay, I made an edit to my previous code to try and get you what you are wanting. Hopefully I am on the right track here. Let me know if you still have issues. — Quixrick, Mar 12 '14 at 18:36

Regex to find URL in string wrapped in HTML or not

The TL;DR version :

1 Answers1