Hello i want to extract links
<a href="/portal/clients/show/entityId/2121" >
and i want a regex which givs me /portal/clients/show/entityId/2121
the number at last 2121 is in other links different
any idea?

- 398,947
- 96
- 818
- 769

- 32,000
- 37
- 101
- 123
-
do you want to extract '2121' from '/portal/clients/show/entityId/2121' using regex? – halocursed Oct 05 '09 at 12:11
-
no i want to extract '/portal/clients/show/entityId/2121' another link can have different number at last instead 2121 any idea? – streetparade Oct 05 '09 at 12:13
6 Answers
Simple PHP HTML Dom Parser example:
// Create DOM from string
$html = str_get_html($links);
//or
$html = file_get_html('www.example.com');
foreach($html->find('a') as $link) {
echo $link->href . '<br />';
}

- 339,989
- 67
- 413
- 406
-
-
but i just would extract /portal/clients/show/entityId/4636 so this worked '/]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i' – streetparade Oct 05 '09 at 12:26
Don't use regular expressions for proccessing xml/html. This can be done very easily using the builtin dom parser:
$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a/@href');
for ($i = 0; $i < $nodeList->length; $i++) {
# Xpath query for attributes gives a NodeList containing DOMAttr objects.
# http://php.net/manual/en/class.domattr.php
echo $nodeList->item($i)->value . "<br/>\n";
}
Regex for parsing links is something like this:
'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'
Given how horrible that is, I would recommend using Simple HTML Dom for getting the links at least. You could then check links using some very basic regex on the link href.

- 54,544
- 15
- 116
- 120
-
this worked for me $patterndocumentLinks ='/]+|"[^"]*"|\'[^\']*\')*href=("[^"]+"|\'[^\']+\'|[^<>\s]+)/i'; thank you – streetparade Oct 05 '09 at 12:25
-
@streetparade You probably want to avoid including the quotes bounding the attribute values in your captured values, thus, adjust the regex capture parens accordingly: '/]+|"[^"]*"|\'[^\']*\')*href="([^"]+)"|\'[^\']+\'|[^<>\s]+/i' – Adam Friedman Aug 28 '14 at 16:56
When "parsing" html I mostly rely on PHPQuery: http://code.google.com/p/phpquery/ rather then regex.

- 15,693
- 14
- 81
- 131
This is my solution:
<?php
// get links
$website = file_get_contents("http://www.example.com"); // download contents of www.example.com
preg_match_all("<a href=\x22(.+?)\x22>", $website, $matches); // save all links \x22 = "
// delete redundant parts
$matches = str_replace("a href=", "", $matches); // remove a href=
$matches = str_replace("\"", "", $matches); // remove "
// output all matches
print_r($matches[1]);
?>
I recommend to avoid using xml-based parsers, because you will not always know, whether the document/website has been well formed.
Best regards
Paring links from HTML can be done using am HTML parser.
When you have all links, simple get the index of the last forward slash, and you have your number. No regex needed.

- 166,582
- 36
- 299
- 288