1

I'm kinda stuck here.
I have this pattern:
<a class="title" href="showthread.php?t=XXXXX" id="thread_title_XXX">DATADATA</a>
I know that in my string (a webpage) all my data is stored in this format, while it has the 'unique signature' I just wrote. the XXX's count is dynamic, probabaly somewhere between 2 to 12 DIGITS (each X is a digit).
I can write a long expression to find the whole line, but I want to extract the data, not the whole thing.

How can I do it ? An example would be appreciated.
Thank you!

Mark Segal
  • 5,427
  • 4
  • 31
  • 69

3 Answers3

4

Forget about regular expressions, they're not meant to parse formats like HTML, especially if an actual parser exists for it already.

Find the nodes using XPath:

$html = <<<EOT

<html>
Some html
<a class="title" href="showthread.php?t=XXXXX" id="thread_title_XXX">DATADATA</a>
</html>

EOT;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a[starts-with(@href, "showthread.php")]') as $node) {
    // ...
}

Then extract the data using substr, strpos and parse_str:

$href = $node->getAttribute('href');
parse_str(substr($href, strpos($href, '?')+1), $query);
$t = $query['t'];

$id = $node->getAttribute('id');
$title = substr($id, strlen('thread_title_'));

$data = $node->nodeValue;

var_dump($t, $title, $data);

You get:

string(5) "XXXXX"
string(3) "XXX"
string(8) "DATADATA"
netcoder
  • 66,435
  • 19
  • 125
  • 142
  • I love virtually everything about this answer. I might perform the surgery on the extracted strings a little differently though. Probably better to use a url parser on the `href` value. My take on your snippet: https://3v4l.org/EGPlC Answers that avoid parsing html with regex should always have more upvotes than regex parsing answers. – mickmackusa Sep 17 '20 at 22:22
3

Try to use :

 $parsed_str = '<a class="title" href="showthread.php?t=45343" id="thread_title_XXX">DATADATA</a><a class="title" href="showthread.php?t=466666" id="thread_title_XXX">DATADATA</a> fasdfasdfsdfasd gfgfkgbc  04034kgs <fdfd> dfs</fdfa> <a class="title" href="showthread.php?t=7777" id="thread_title_XXX">DATADATA</a>';
 preg_match_all("/.*?\?t\=([\d]{2,12}).*?/", $parsed_str, $result);
 print_r($result);
Andrej
  • 7,474
  • 1
  • 19
  • 21
2

what actually you want to do ? Get the XXXXX signature or all links?

try this - this is get a signature and data

<?php 
$S = '<a class="title" href="showthread.php?t=1234567" id="thread_title_XXX">DATADATA</a>';
$pattern = '!<a.*href="showthread.php\?t=(.*)".* id=".*">(.*)</a>!';

echo "<pre>";
print_r(preg_match($pattern, $S, $res));
print_r($res);
echo "</pre>";
?>
ZigZag
  • 539
  • 1
  • 8
  • 19