0

I am trying to get the inner html of a <p> tag and save it as a .txt file. It is a very simple page; there is only one <p> on it. I tried using getElementsByTagName('p') as per: Using PHP to get DOM Element. Unfortunately, it didn't work for me, but maybe I'm missing something. My code is:

<?php
$dataPage = file_get_contents('http://www.somedataurl.com');
$doc = new DOMDocument;
$doc->loadHTML($dataPage);

$dataNodeList = $doc->getElementsByTagName('p');
$dataNode = $dataNodeList->item(0);

function innerHTML($node) {
    return implode(array_map([$node->ownerDocument, "saveHTML"],
            iterator_to_array($node->childNodes)));
}

$theData = innerHTML($dataNode);

header('Content-Type: text/plain');
$filename = date('Y-m-d') . '.txt';
file_put_contents($filename, $theData);

The error log is giving me:

PHP Notice: Undefined property:: DOMNodeList (line 10)

PHP Notice: Undefined property:: DOMNodeList (line 11)

PHP Catchable fatal error (line 11)

These errors sound rather alarming, especially the last one.

Question: Is there a better tool I can use other than getElementsByTagName() since I am only dealing with one <p>? Or can this way work if I adjust a few things?

Phil
  • 157,677
  • 23
  • 242
  • 245
Arash Howaida
  • 2,575
  • 2
  • 19
  • 50
  • Have you confirmed that your script finds any `

    ` tags?

    – Phil May 16 '18 at 02:53
  • To directly answer your question; I'd say you're using the most efficient method right now. – Phil May 16 '18 at 02:55
  • @Phil I know the `

    ` tags are there, but judging by "Undefined property" I think my script might not be finding them for some reason. My only other hunch was a data structure error, domnodelist vs node

    – Arash Howaida May 16 '18 at 03:05
  • How about doing some simple debugging? Eg `var_dump($dataNodeList->length)` or even `var_dump($doc->saveHTML())` to make sure you're getting the document you expect – Phil May 16 '18 at 03:07
  • @Phil Its on a remote server and I have to use cron to run it. My IDE doesn't run PHP, debug learning curve is tough on cron. I typically can only tell if it works if a txt file is created or not. I have had it working in the past using `xpath` for finding elements by class name, but this new page only has a `

    ` with no class.

    – Arash Howaida May 16 '18 at 03:11
  • 1
    So write to the error log ~ http://php.net/manual/function.error-log.php – Phil May 16 '18 at 03:11
  • Also, try running it locally in a docker container or VM or something. It's quite easy these days to get a local PHP development environment up and running – Phil May 16 '18 at 03:21
  • How about even just using `curl` or Postman to check the HTML returned. Make sure the `

    ` tag you think is there isn't coming from JavaScript

    – Phil May 16 '18 at 03:25
  • @Phil Well you won't believe this, but, as per your suggestion, when I added `error_log($dataNodeList->length, 0);` it worked, seemingly out of the blue. Text file created with the data. Error log read: 1, meaning the list had found the p. So weird. – Arash Howaida May 16 '18 at 04:33

1 Answers1

0

if there is only one P tag,i think you had better extract P content using Regular Expressions

example:

preg_match("/<p>(.*?)<\/p>/is",$dataPage,$match);
print_r($match[1]);
Lane
  • 13
  • 1
  • 5
  • That's assuming the `

    ` tag doesn't have any attributes and there's no newlines between the `<` and `>` characters of each opening and closing tag. Then there's this ~ https://stackoverflow.com/a/1732454/283366

    – Phil May 16 '18 at 03:19
  • Also, OP's code should work just fine if there is in fact a `

    ` tag in the document

    – Phil May 16 '18 at 03:22