0

I'm working on a PHP parser that parses my school's HTML 'groups' page. These are pages with a unique URL based on the name of the course and several other variables. The page consists of a bunch of HTML <table>'s.

Loading the HTML from the url works fine up until it comes across a ) in the file's content. Then it just stops loading and only stores what it's gotten so far. Obviously, the HTML loaded was not created by me and there is no way i can prevent such characters from being in the HTML code.

It however works fine when i run it locally using MAMP. I tried looking for answers, but haven't found anything that solved my problem.

How can i escape these characters before loading it?

My current PHP:

$dom = new DOMDocument; 
libxml_use_internal_errors(true); // the HTML i parse contains a lot of unclosed tags, this to prevent the errors from displaying on the page
$dom->loadHTMLFile('http://isarog.hhs.nl/Web_Site/HHS/ICTM/Public/Iris_Roster/Timetables/11_2/11_2-CMD-4vt-p2.html');   

echo $dom->getElementsByTagName('html')->item(0)->nodeValue;
Joey
  • 1,664
  • 3
  • 19
  • 35
  • AFAIK parenthesis has no meaning in html, are you sure there isn't something else? if you create an identical page but without parenthesis, and load that instead, does it work? – Damien Pirsy Nov 21 '11 at 23:13
  • I have not tried that, but when i echo the `` tag's `nodeValue`, it shows everything up until that parenthesis comes into play. – Joey Nov 21 '11 at 23:15
  • Well...try that..What's there beyond the parenthesis? – Damien Pirsy Nov 21 '11 at 23:17
  • the HTML url: http://isarog.hhs.nl/Web_Site/HHS/ICTM/Public/Iris_Roster/Timetables/11_2/11_2-CMD-4vt-p2.html A fragment of where the problem happens: `.. Senad Mato):evic ..`. When printed out, it displays: `.. Senad Mato` I'll try the identical page now. – Joey Nov 21 '11 at 23:21
  • Don't you see there's a character between "Mato" and the parenthesis? use the right encoding while loading the file, can be that one givin' problems – Damien Pirsy Nov 21 '11 at 23:23
  • I tried the identical file and the parathesis is definitely what is causing it to not load properly. When removed, it works fine. I dont see a character between Mato and the parentheses. – Joey Nov 21 '11 at 23:29
  • 1
    I do, though I cannot see what character is; and not only there. Try again not removing the parenthesis, but deleting "Mato)" and rewriting it: parenthesis should work fine then. Either the parenthesis is not the "regular" one, or the "o", or another character in between. – Damien Pirsy Nov 21 '11 at 23:31
  • I've rewritten it and it seems to work fine now. Anyway i can solve this problem with PHP without manually rewriting any HTML? – Joey Nov 21 '11 at 23:38

1 Answers1

0

This question solved my problem: Remove control characters from php String

Apparently there was an invisible character in my HTML input that was causing the load function to stop reading. The following cleared it all up:

$str = file_get_contents('http://isarog.hhs.nl/Web_Site/HHS/ICTM/Public/Iris_Roster/Timetables/11_2/11_2-CMD-4vt-p2.html');
$str = mb_convert_encoding($str, 'utf-8', mb_detect_encoding($str));

$str = preg_replace('/[\x00-\x1F\x7F]/', '', $str);
$str = ereg_replace("[[:cntrl:]]", "", $str);

$dom = new DOMDocument;
libxml_use_internal_errors(true); // Screw al die markup syntax errors dan ook
$dom->loadHTML($str);
Community
  • 1
  • 1
Joey
  • 1,664
  • 3
  • 19
  • 35