0

I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.

I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.

I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.

Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks

Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941

http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566    

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392</a>

<a href="http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&amp;AidKey=1998267392">This is a link description</a>

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.

In the end I am just looking for the URL.

http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736
Tim
  • 403
  • 1
  • 6
  • 20

2 Answers2

1

DO NOT USE A REGEX! Use a XML parser...

$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[@href]');

foreach($anchors as $anchor){
  $href = $anchor->getAttribute('href');
  if(preg_match($regexToMatchUrls, $href)){
    //do stuff
  }
}

So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.

Community
  • 1
  • 1
prodigitalson
  • 60,050
  • 10
  • 100
  • 114
  • 1
    I will check it out. By the way does it also find links that are plain text no HREF wrapped around it? – Tim Aug 04 '11 at 22:04
  • No this doesnt.. this is only for anchor tags. If you dont need to analyze the anchor then you could/should use a regex. I thought you needed only urls within an `` type of context and then mathing a given pattern... My mistake :-) – prodigitalson Aug 04 '11 at 22:33
  • Just a note you could also do that with xpath by searching for text nodes containing the anchor text, but that would probably be more trouble than just using a regex... So bottom line if oyu need to search for things where the HTML/XML gives the actual context then use an XML parser... if youre jsut searching for a particular test string pattern throughout the entire document then use a regex. – prodigitalson Aug 04 '11 at 22:37
0

This regex work for me:

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&amp;)AidKey=([\d\w-]*)/g

UPDATE: I added a \d at the end of the regex.

/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&amp;)AidKey=([\d\w-]*)\d/g

To use it in PHP you need /.../msi

PHP Example in action: http://ideone.com/N0TKM

scube
  • 1,931
  • 15
  • 21
  • I did not have any luck. I did however when I changed /g to /im (/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/im) – Tim Aug 04 '11 at 23:54
  • Thanks, all tested and going smooth. – Tim Aug 05 '11 at 15:27