PHP save inner html of p tag, only 1 p tag on page

Question

I am trying to get the inner html of a <p> tag and save it as a .txt file. It is a very simple page; there is only one <p> on it. I tried using getElementsByTagName('p') as per: Using PHP to get DOM Element. Unfortunately, it didn't work for me, but maybe I'm missing something. My code is:

<?php
$dataPage = file_get_contents('http://www.somedataurl.com');
$doc = new DOMDocument;
$doc->loadHTML($dataPage);

$dataNodeList = $doc->getElementsByTagName('p');
$dataNode = $dataNodeList->item(0);

function innerHTML($node) {
    return implode(array_map([$node->ownerDocument, "saveHTML"],
            iterator_to_array($node->childNodes)));
}

$theData = innerHTML($dataNode);

header('Content-Type: text/plain');
$filename = date('Y-m-d') . '.txt';
file_put_contents($filename, $theData);

The error log is giving me:

PHP Notice: Undefined property:: DOMNodeList (line 10)

PHP Notice: Undefined property:: DOMNodeList (line 11)

PHP Catchable fatal error (line 11)

These errors sound rather alarming, especially the last one.

Question: Is there a better tool I can use other than getElementsByTagName() since I am only dealing with one <p>? Or can this way work if I adjust a few things?

To directly answer your question; I'd say you're using the most efficient method right now. — Phil, May 16 '18 at 02:55
@Phil I know the `
` tags are there, but judging by "Undefined property" I think my script might not be finding them for some reason. My only other hunch was a data structure error, domnodelist vs node — Arash Howaida, May 16 '18 at 03:05
How about doing some simple debugging? Eg `var_dump($dataNodeList->length)` or even `var_dump($doc->saveHTML())` to make sure you're getting the document you expect — Phil, May 16 '18 at 03:07
@Phil Its on a remote server and I have to use cron to run it. My IDE doesn't run PHP, debug learning curve is tough on cron. I typically can only tell if it works if a txt file is created or not. I have had it working in the past using `xpath` for finding elements by class name, but this new page only has a `
` with no class. — Arash Howaida, May 16 '18 at 03:11
So write to the error log ~ http://php.net/manual/function.error-log.php — Phil, May 16 '18 at 03:11
Also, try running it locally in a docker container or VM or something. It's quite easy these days to get a local PHP development environment up and running — Phil, May 16 '18 at 03:21
How about even just using `curl` or Postman to check the HTML returned. Make sure the `
` tag you think is there isn't coming from JavaScript — Phil, May 16 '18 at 03:25
@Phil Well you won't believe this, but, as per your suggestion, when I added `error_log($dataNodeList->length, 0);` it worked, seemingly out of the blue. Text file created with the data. Error log read: 1, meaning the list had found the p. So weird. — Arash Howaida, May 16 '18 at 04:33

score 0 · Answer 1 · answered May 16 '18 at 03:14

0

if there is only one P tag,i think you had better extract P content using Regular Expressions

example:

preg_match("/<p>(.*?)<\/p>/is",$dataPage,$match);
print_r($match[1]);

answered May 16 '18 at 03:14

Lane

13
1
5

That's assuming the `
` tag doesn't have any attributes and there's no newlines between the `<` and `>` characters of each opening and closing tag. Then there's this ~ https://stackoverflow.com/a/1732454/283366
– Phil May 16 '18 at 03:19
Also, OP's code should work just fine if there is in fact a `
` tag in the document
– Phil May 16 '18 at 03:22

PHP save inner html of p tag, only 1 p tag on page

1 Answers1