Preg_match_all

Question

Hello i want to extract links <a href="/portal/clients/show/entityId/2121" > and i want a regex which givs me /portal/clients/show/entityId/2121 the number at last 2121 is in other links different any idea?

do you want to extract '2121' from '/portal/clients/show/entityId/2121' using regex? — halocursed, Oct 05 '09 at 12:11
no i want to extract '/portal/clients/show/entityId/2121' another link can have different number at last instead 2121 any idea? — streetparade, Oct 05 '09 at 12:13

karim79 · Answer 1 · 2009-10-05T12:29:32.263

11

Simple PHP HTML Dom Parser example:

// Create DOM from string
$html = str_get_html($links);

//or
$html = file_get_html('www.example.com');

foreach($html->find('a') as $link) {
    echo $link->href . '<br />';
}

edited Oct 05 '09 at 12:29

answered Oct 05 '09 at 12:19

karim79

339,989
67
413
406

this would give that as result " – streetparade Oct 05 '09 at 12:26
but i just would extract /portal/clients/show/entityId/4636 so this worked '/]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i' – streetparade Oct 05 '09 at 12:26

score 6 · Answer 2 · edited May 23 '17 at 12:19

Don't use regular expressions for proccessing xml/html. This can be done very easily using the builtin dom parser:

$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a/@href');
for ($i = 0; $i < $nodeList->length; $i++) {
    # Xpath query for attributes gives a NodeList containing DOMAttr objects.
    # http://php.net/manual/en/class.domattr.php
    echo $nodeList->item($i)->value . "<br/>\n";
}

score 1 · Accepted Answer · answered Oct 05 '09 at 12:20

1

Regex for parsing links is something like this:

'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'

Given how horrible that is, I would recommend using Simple HTML Dom for getting the links at least. You could then check links using some very basic regex on the link href.

answered Oct 05 '09 at 12:20

Yacoby

54,544
15
116
120

this worked for me $patterndocumentLinks ='/]+|"[^"]*"|\'[^\']*\')*href=("[^"]+"|\'[^\']+\'|[^<>\s]+)/i'; thank you – streetparade Oct 05 '09 at 12:25
@streetparade You probably want to avoid including the quotes bounding the attribute values in your captured values, thus, adjust the regex capture parens accordingly: '/]+|"[^"]*"|\'[^\']*\')*href="([^"]+)"|\'[^\']+\'|[^<>\s]+/i' – Adam Friedman Aug 28 '14 at 16:56

score 1 · Answer 4 · answered Oct 05 '09 at 12:24

1

When "parsing" html I mostly rely on PHPQuery: http://code.google.com/p/phpquery/ rather then regex.

answered Oct 05 '09 at 12:24

Max

15,693
14
81
131

score 1 · Answer 5 · 2013-11-03T18:31:44.177

This is my solution:

<?php
// get links
$website = file_get_contents("http://www.example.com"); // download contents of www.example.com
preg_match_all("<a href=\x22(.+?)\x22>", $website, $matches); // save all links \x22 = "

// delete redundant parts
$matches = str_replace("a href=", "", $matches); // remove a href=
$matches = str_replace("\"", "", $matches); // remove "

// output all matches
print_r($matches[1]);
?>

I recommend to avoid using xml-based parsers, because you will not always know, whether the document/website has been well formed.

Best regards

score 0 · Answer 6 · answered Oct 05 '09 at 12:10

0

Paring links from HTML can be done using am HTML parser.

When you have all links, simple get the index of the last forward slash, and you have your number. No regex needed.

answered Oct 05 '09 at 12:10

Bart Kiers

166,582
36
299
288

hmm.. $html->find('href') or what? – streetparade Oct 05 '09 at 12:11
I don't know. What does this find(...) come from? – Bart Kiers Oct 05 '09 at 12:42

Preg_match_all

6 Answers6

Linked

Related