scrape data using regex and simplehtmldom

Question

i am trying to scrape some data from this site : http://laperuanavegana.wordpress.com/ . actually i want the title of recipe and ingredients . ingredients is located inside two specific keyword . i am trying to get this data using regex and simplehtmldom . but its showing the full html text not just the ingredients . here is my code : <?php

include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";

traverse($base_url);


function traverse($base_url)
{
    
    $html = file_get_html($base_url);
    $k1="Ingredientes";
    $k2="Preparación";
    preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
    echo $out[0][0];
}

?>

there is multiple ingredients in this page . i want all of them . so using preg_match_all() it will be helpful if anybody detect the bug of this code . thanks in advance.

score 4 · Answer 1 · edited May 23 '17 at 11:55

When you are already using an HTML parser (even a poor one like SimpleHtmlDom), why are you trying to mess up things with Regex then? That's like using a scalpel to open up the patient and then falling back to a sharpened spoon for the actual surgery.

Since I strongly believe no one should use SimpleHtmlDom because it has a poor codebase and is much slower than libxml based parsers, here is how to do it with PHP's native DOM extension and XPath. XPath is effectively the Regex or SQL for X(HT)ML documents. Learn it, so you will never ever have to touch Regex for HTML again.

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
    $recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);

This will output:

Array
(
    [title] => Ensalada tibia de quinua, mango y tomate
    [ingredients] => Array
        (
            [0] => 250gr de quinua cocida tibia
            [1] => 1 mango grande
            [2] => 2 tomates
            [3] => Unas hojas de perejil
            [4] => Sal
            [5] => Aceite de oliva
            [6] => Vinagre balsámico
        )

)

Note that we are not parsing http://laperuanavegana.wordpress.com/ but the actual blog post. The main URL will change content whenever the blog owner adds a new post.

To get all the Recipes from the main page, you can use

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
    echo
        ($node->nodeName === 'a') ? "\n# " : '- ',
        $node->nodeValue,
        PHP_EOL;
}

This will output

# Ensalada tibia de quinua, mango y tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico

# Flan de lúcuma
- 1 lúcuma grandota o 3 pequeñas
- 1/2 litro de leche de soja evaporada
…

and so on

Also see

thanks for your valuable suggestion . i will learn it ASAP . but i really need it doing using regex . because i have to keywords "Ingredientes" and "Preparación" inside which the ingredients reside . would you please tell me a way to do this ?? @Chronial has already answered my question . i need some more details .and i have mentioned that in the previous comment — Quazi Marufur Rahman, Aug 13 '11 at 16:38
@qmaruf why do you need to use the keywords when the above code gives you the ingredients already? — Gordon, Aug 13 '11 at 16:44
this is something like a project . and using regex is the requirement . so i am bound to use it . i will learn the DOMDocument ASAP — Quazi Marufur Rahman, Aug 13 '11 at 16:49
@qmaruf no offense, but then its a stupid project. You do not need Regex for this task. You can get any information you want from the document faster and more reliable with DOM and XPath. Tell whoever made Regex a requirement for this project that it is the wrong tool for the job. Regex does not understand HTML. You are reinventing the wheel by teaching Regex to understand HTML. Parsing HTML is a solved problem. You use a HTML/XML parser for that. — Gordon, Aug 13 '11 at 16:58

score 3 · Accepted Answer · answered Aug 13 '11 at 15:51

3

You need to add a question mark there. It makes the pattern ungreedy - otherwise it will take everything form the first $k1 to the last $k2 on the page. If you add the question mark it will always take the next $k2.

preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);

answered Aug 13 '11 at 15:51

Chronial

66,706
14
93
99

thanks . is it possible to know how many match has been found ?? AND this regex is showing all text including K1 and K2 but i want only this text inside them – Quazi Marufur Rahman Aug 13 '11 at 16:00
Well, just look at the content of $out and you will find out. You can print array contents with [`print_r`](http://www.php.net/manual/en/function.print-r.php) and count array elements with [`count()`](http://www.php.net/manual/en/function.count.php). – Chronial Aug 13 '11 at 16:21
would you please make some more help ?? i want to get all the ingredients of that site . so i have to traverse the whole site . i can call all the link from the first page recursively to do so . but it will make a problem if there is back link . would you please solve this problem ?? – Quazi Marufur Rahman Aug 13 '11 at 17:51

scrape data using regex and simplehtmldom

2 Answers2

Linked