0

My code is:

$rawhtml = file_get_contents( "site url" );

$pat= '/((http|ftp|https):\/\/[\w#$&+,\/:;=?@.-]+)[^\w#$&+,\/:;=?@.-]/i';

preg_match_all($pat,$rawhtml,$matches1);

foreach($matches1[1] as $plinks)
{
    $links_array[]=$plinks;
}

After testing several situations I noted that the function had some "leaks". The link gets broken if there is whitespace.

For example I have this text URL in a variable:

$rawhtml = " http://www.filesonic.com/file/2185085531/TEST Voice 640-461 Test Cert Guide.epub
"

The result should be one link by line:

http://www.filesonic.com/file/2185085481/TEST Voice (640)+461 Test Cert Guide.pdf

but the result is

http://www.filesonic.com/file/2185085531/TEST

Sometimes extracted links also contains , or ' or " at the end. How to get rid of these?

JJJ
  • 32,902
  • 20
  • 89
  • 102
Harvi
  • 1
  • 5
    What you ask may well be impossible. How can the script know that in `"visit http://example.com now"` the space isn't part of the URL, but in `"download http://example.com/white space.pdf now"` it is? – JJJ Feb 03 '12 at 14:01
  • 3
    Technically the spaces don't belong in the URL. Working around that lack of proper syntax would only be a hack. And without knowing the actual source page, hard to fix. But have you considered an alternative approach mentioned in the various other [extract links form html](http://stackoverflow.com/search?q=extract%20links%20from%20html%20php) questions? – mario Feb 03 '12 at 14:02
  • how to get rid of those commas,quotes or double quotes from the extracted links – Harvi Feb 03 '12 at 14:19

1 Answers1

0

how to get rid of those commas,quotes or double quotes from the extracted links

One could use (?<![,'"]) to exclude something at the end. But your problem is that you simply shouldn't use the trailing character class:

 [^\w#$&+,\/:;=?@.-]

That's what matches " and '.

As a hackish workaround to the other problem, the first character class could be augmented with a space.

 [\w#$&+,\/:;=?@. -]+
                 ▵

As said, that's probably not a good solution and might lead to other mismatches.

mario
  • 144,265
  • 20
  • 237
  • 291