My code is:
$rawhtml = file_get_contents( "site url" );
$pat= '/((http|ftp|https):\/\/[\w#$&+,\/:;=?@.-]+)[^\w#$&+,\/:;=?@.-]/i';
preg_match_all($pat,$rawhtml,$matches1);
foreach($matches1[1] as $plinks)
{
$links_array[]=$plinks;
}
After testing several situations I noted that the function had some "leaks". The link gets broken if there is whitespace.
For example I have this text URL in a variable:
$rawhtml = " http://www.filesonic.com/file/2185085531/TEST Voice 640-461 Test Cert Guide.epub
"
The result should be one link by line:
http://www.filesonic.com/file/2185085481/TEST Voice (640)+461 Test Cert Guide.pdf
but the result is
http://www.filesonic.com/file/2185085531/TEST
Sometimes extracted links also contains ,
or '
or "
at the end. How to get rid of these?