0

I want to get some data from some html pages that I have and then store the data in the database.

The HTML file has a list of blogs and they are organized like this:

 <div class="breadlist"></div>    

    <h3 class="list"><a href="http://test1.com">Title 1</a></h3>
    <p><strong>Description:</strong> Description 1.<br>
    <strong>Author:</strong> Author1<br>
    <strong>XML:</strong> <a href="http://test1.com/feed">Title 1</a><br>
    <strong>Language:</strong> Language1</p>

    <h3 class="list"><a href="http://test2.com">Title 2</a></h3>
    <p><strong>Description:</strong>Description 2. <br>
    <strong>Author:</strong> Author1<br>
    <strong>XML:</strong> <a href="http://test2.com/feed">Title 2</a>  
    <strong>Language:</strong> Español</p>

<div class="breadlist"></div>

In this example, there are 2 blogs, but sometimes there are 10 or even 100 blogs. Every file has a different amount. I would like to get this data:

Website Address, Title, Description, Author, Feed, Language.

I was trying to do it with PHP Simple HTML DOM Parser, but today is the first time I was trying and couldn't get anywhere. I imagine I have to loop something but don't know how to do it. Anybody have any idea how to do it with PHP? Thanks!

----EDIT---- This is what I've tried so far:

$str = <<<HTML
<div class="breadlist"></div>    

    <h3 class="list"><a href="http://test1.com">Title 1</a></h3>
    <p><strong>Description:</strong> Description 1.<br>
    <strong>Author:</strong> Author1<br>
    <strong>XML:</strong> <a href="http://test1.com/feed">Title 1</a><br>
    <strong>Language:</strong> Language1</p>

    <h3 class="list"><a href="http://test2.com">Title 2</a></h3>
    <p><strong>Description:</strong>Description 2. <br>
    <strong>Author:</strong> Author1<br>
    <strong>XML:</strong> <a href="http://test2.com/feed">Title 2</a>  
    <strong>Language:</strong> Español</p>

<div class="breadlist"></div>
HTML;

$html = str_get_html($str);
    foreach($html->find('h3[class=list]') as $title){
       echo "Title: " . $title->innertext . "<br />";
    }
    foreach($html->find('h3[class=list] a') as $address){
       echo "Address: " . $address->href . "<br />";       
}
 foreach($html->find('p') as $description){

       echo "Description: " . $description->childNodes(3)->plaintext . "<br />"; //doesnt work
 }
 foreach($html->find('p a') as $feed){
       echo "Feed: " . $feed->href . "<br />";       
}
 foreach($html->find('h3[class=list] a') as $language){
       echo "Language: " . $language->innertext . "<br />"; // doesnt work       
}
raygo
  • 1,348
  • 5
  • 18
  • 40

2 Answers2

0

Use strip_tags:

echo strip_tags($html_text);

If the data is always in the same order in your HTML code, it may be sufficient.

Jocelyn
  • 11,209
  • 10
  • 43
  • 60
0

I couldn't find a way to do it so I just did a find, replace and modify it in a way that could use the PHP Simple HTML DOM Parser

raygo
  • 1,348
  • 5
  • 18
  • 40