0

I'm trying to scrape a webpage using phpsimpledom.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' 
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

I tried my best to get text from each class="u" but it didn't work because there is missing closing tag </div> on first tag <div class="u">. Can anyone help me out with that....

Barmar
  • 741,623
  • 53
  • 500
  • 612

1 Answers1

1

You can find an element close to where the tag should have been closed and then standardize the html by replacing it. For example, you can replace the </a> tag by </a></div>.

str_replace('</a>','</a></div>',$html)

or if there are too many closed </a> tags , replace </a><div class="u"> with </a></div><div class="u">

str_replace('</a><div class="u">','</a></div><div class="u">',$html)

There may be another problem. There is a gap between the tags and the replacement does not work properly. To solve this problem, you can first delete the spaces between the tags and then replace them.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' ;
$html = preg_replace('~>\\s+<~m', '><', $html);
str_replace('</a><div class="u">','</a></div><div class="u">',$html);
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

  • A word of caution -- this solution would work but only for the example given. Unless @user16202411's problem is really predictable, trying to clean it in this way could be prohibitively difficult to debug and maintain. – Mark Jul 21 '21 at 16:37
  • Yes,I use this as a quick way to fix and crawl pages that have an invalid html. There are other ways, such as using Tidy or the way mentioned on this page.[link](https://stackoverflow.com/questions/3810230/close-open-html-tags-in-a-string ) – sama latifi Jul 21 '21 at 20:38