I'm kinda stuck here.
I have this pattern:
<a class="title" href="showthread.php?t=XXXXX" id="thread_title_XXX">DATADATA</a>
I know that in my string (a webpage) all my data is stored in this format, while it has the 'unique signature' I just wrote. the XXX's count is dynamic, probabaly somewhere between 2 to 12 DIGITS (each X is a digit).
I can write a long expression to find the whole line, but I want to extract the data, not the whole thing.
How can I do it ? An example would be appreciated.
Thank you!
Asked
Active
Viewed 213 times
1

Mark Segal
- 5,427
- 4
- 31
- 69
3 Answers
4
Forget about regular expressions, they're not meant to parse formats like HTML, especially if an actual parser exists for it already.
Find the nodes using XPath:
$html = <<<EOT
<html>
Some html
<a class="title" href="showthread.php?t=XXXXX" id="thread_title_XXX">DATADATA</a>
</html>
EOT;
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[starts-with(@href, "showthread.php")]') as $node) {
// ...
}
Then extract the data using substr, strpos and parse_str:
$href = $node->getAttribute('href');
parse_str(substr($href, strpos($href, '?')+1), $query);
$t = $query['t'];
$id = $node->getAttribute('id');
$title = substr($id, strlen('thread_title_'));
$data = $node->nodeValue;
var_dump($t, $title, $data);
You get:
string(5) "XXXXX"
string(3) "XXX"
string(8) "DATADATA"

netcoder
- 66,435
- 19
- 125
- 142
-
I love virtually everything about this answer. I might perform the surgery on the extracted strings a little differently though. Probably better to use a url parser on the `href` value. My take on your snippet: https://3v4l.org/EGPlC Answers that avoid parsing html with regex should always have more upvotes than regex parsing answers. – mickmackusa Sep 17 '20 at 22:22
3
Try to use :
$parsed_str = '<a class="title" href="showthread.php?t=45343" id="thread_title_XXX">DATADATA</a><a class="title" href="showthread.php?t=466666" id="thread_title_XXX">DATADATA</a> fasdfasdfsdfasd gfgfkgbc 04034kgs <fdfd> dfs</fdfa> <a class="title" href="showthread.php?t=7777" id="thread_title_XXX">DATADATA</a>';
preg_match_all("/.*?\?t\=([\d]{2,12}).*?/", $parsed_str, $result);
print_r($result);

Andrej
- 7,474
- 1
- 19
- 21
2
what actually you want to do ? Get the XXXXX signature or all links?
try this - this is get a signature and data
<?php
$S = '<a class="title" href="showthread.php?t=1234567" id="thread_title_XXX">DATADATA</a>';
$pattern = '!<a.*href="showthread.php\?t=(.*)".* id=".*">(.*)</a>!';
echo "<pre>";
print_r(preg_match($pattern, $S, $res));
print_r($res);
echo "</pre>";
?>

ZigZag
- 539
- 1
- 8
- 19