2

I am using the PHP Simple HTML DOM parser to scrap website data, but unfortunately not able to extract the data i want to. I have also tried to google and look in the documentation but could not solve the issue. The code structure of what i am trying to scrap is something like this.

<div id="section1">
   <h1>Some content</h1>
   <p>Some content</p>
   ............
    <<Not fixed number of element>>
   ............
   <script> <<Some script>></script>
   <video>
     <source src="www.exmple.com/34/exmple.mp4">
   </video>
</div>

I tried with JavaScript and i could do the same like this

document.getElementById("section1").getElementsByTagName("source")[0].getAttribute("src");

But when i tried with PHP Dom parser i m not getting any data. Here is how my code looks likes

require ''.$_SERVER['DOCUMENT_ROOT'].'/../lib/simplehtmldom/simple_html_dom.php';

 $html_content = get($url); //This is cURL function to get website content.
 $obj_content = str_get_html($html_content);
 $linkURL = $obj_content->getElementById('section1')->find('source',0)->getAttribute('src');
var_dump($linkURL); 

This results in an empty string. I also tried changing to code a bit here and there but none of those works every time came blank. But if i var dump $obj_content i get lot of dom element

I tried to follow these posts from stackoverflow which are similar to mine , but these did not help me.

  1. How do I get the HTML code of a web page in PHP?
  2. PHP Simple HTML DOM
  3. PHP Simple HTML DOM Parser Call to a member function children() on a non-object
  4. And their manual http://simplehtmldom.sourceforge.net/manual.htm

Can anyone please help me

Thank you

user7747472
  • 1,874
  • 6
  • 36
  • 80
  • Is that part of the HTML added dynamically after page load? – WillardSolutions Aug 13 '18 at 16:48
  • No the page load once. There is no dynamically adding after that – user7747472 Aug 13 '18 at 16:52
  • So if you var_dump whatever is returned from your cURL request, do you see this source tag with a value in the src attribute? – WillardSolutions Aug 13 '18 at 16:55
  • if i var_dump the curl response i see the complete page – user7747472 Aug 13 '18 at 16:57
  • 1
    OK then - look at the HTML from the var_dump, find the #section1 > source[0] path, and see if there's a value in the src attribute. – WillardSolutions Aug 13 '18 at 17:11
  • Can you share the URL your trying with? – Nigel Ren Aug 13 '18 at 17:55
  • This works: `$dom->getElementById('section1')->find('video', 0)->find('source', 0)->getAttribute('src');` The key is to find the parent ` – drew010 Aug 13 '18 at 18:04
  • 1
    @WillardSolutions, you were correct. The source file url that i am trying to fetch is actually getting injected by the JS script that is above video tag. Extracting content of the script tag and striping the content i took out the url i wanted. – user7747472 Aug 21 '18 at 08:29

1 Answers1

0

The code snippet is fine as it is. Problem was that the URL that I was targeting was not there at the time of page load. It was added by the <script> tag after page being loaded.

Thank you @WillardSolutions

user7747472
  • 1,874
  • 6
  • 36
  • 80