0

I've written a script in php to parse the link of each state located under the title High School Directory by State of a table from this url. My first function fetch_item_links() can extract those links in the right way. What I wish to do now is supply those urls within fetch_info() function so that it will parse the red colored link from the target page.

The second function also works flawlessly when I supply any individual url to test, as in this one.

However, when I try to run the whole script, I don't get any output. No error either.

This is my try so far:

<?php
$url = 'http://www.directoryofschools.com/high-schools/US.html';
$prefix = 'http://www.directoryofschools.com';

function fetch_item_links($link,$base)
{
    $html_doc = new DOMDocument();
    @$html_doc->loadHtmlFile($link);    
    $content_xpath = new DOMXPath($html_doc);
    $item_row = $content_xpath->query('//*[@class="online_college_list"]//tr//td//a[@title]');
    $packtBook = array();
    for ($i=0; $i <$item_row->length; $i++){
        $title = $item_row->item($i)->getAttribute('href') . "<br/>";
        $string = $base . str_replace("..", "", $title);
        $packtBook[] = $string;
    }
    return $packtBook;
}

function fetch_info($link)
{
    $html_doc = new DOMDocument();
    @$html_doc->loadHtmlFile($link);    
    $content_xpath = new DOMXPath($html_doc);
    $item_row = $content_xpath->query('//*[@class="online_college_list"]//tr//td//a[@title]');
    for ($i=0; $i <$item_row->length; $i++){
        $title = $item_row->item($i)->getAttribute('href') . "<br/>";
        echo $title;
    }
}
$items = fetch_item_links($url,$prefix);
foreach($items as $file){
    fetch_info($file);
}
?>

How can I make my script functional?

robots.txt
  • 96
  • 2
  • 10
  • 36
  • You probably have some errors with the `loadHtmlFile()` function but you cannot see them because you use the [error control operator `@`](https://www.php.net/manual/en/language.operators.errorcontrol.php). Try to remove it and see which errors are displayed. – Ugo T. May 29 '19 at 16:54

1 Answers1

1

You're appending <br/> to the URL in fetch_item_links, which means you won't be able to load it via loadHtmlFile(). Change the line to

$title = $item_row->item($i)->getAttribute('href');

In fact, in both places, it might be better to remove the <br/>, and only append it to the string when you're echoing it.

aynber
  • 22,380
  • 8
  • 50
  • 63
  • Right you are @aynber. Your suggestion seemed to have fixed the issue already. – robots.txt May 29 '19 at 16:57
  • A quick little question: when i kick this out `@` from `@$html_doc` as suggested by @Ugo T., I get desired links along with errors as in `Warning: DOMDocument::loadHTMLFile(): Opening and ending tag mismatch: tr and tbody in http://www.directoryofschools.com/high-schools/US.html, line: 54 in C:\xampp\htdocs\PHP\test.php on line 8`. How can I get rid of such error as well. Thanks in advance @aynber. – robots.txt May 30 '19 at 16:36
  • 1
    That's a good question. It sounds like the HTML on the other side is not well formed, so it's a bit tricky. Try this link: https://stackoverflow.com/questions/1148928/disable-warnings-when-loading-non-well-formed-html-by-domdocument-php and use `libxml_use_internal_errors(true);` – aynber May 30 '19 at 16:41
  • Wish I could upvote you solution several times. Yes, it solved the issue as well. – robots.txt May 30 '19 at 17:01