1

I have a text

<div class="ti"><div class="pic">
        <a href="/categories/rr/1.html"><img src="http://www.erty.com/images/440f2d2a.jpg" alt="Ind"> <span>Ind</span></a> (98)
    </div></div><div class="ti"><div class="pic">
        <a href="/categories/ert/1.html"><img src="http://www.erty.com/images/4123d2b.jpg" alt="Wes"> <span>Wes</span></a> (6044)
    </div></div>

How Can I use preg_match_all in PHP to get

  1. /categories/rr/1.html

  2. http://www.erty.com/images/440f2d2a.jpg

  3. Ind

  4. 98

for all entries.

I tried

preg_match_all('|[^<div class="ti"><div class="pic">].*?[^<\/div><\/div>]+|',
$test_html,
$out, PREG_PATTERN_ORDER);

But its not working.

Blaze Mathew
  • 185
  • 9

3 Answers3

0

Never try to parse HTML with RegExp.

Since your html file is probably also an XML file, try this.

$html = "<div class="ti"><div class="pic"><a href="/categories/rr/1.html"><img src="http://www.erty.com/images/440f2d2a.jpg" alt="Ind"> <span>Ind</span></a></div></div><div class="ti"><div class="pic"><a href="/categories/ert/1.html"><img src="http://www.erty.com/images/4123d2b.jpg" alt="Wes"> <span>Wes</span></a></div></div>";
$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);

Or, if you're scraping a website you'd better use jQuery selectors in a node.js app.

napolux
  • 15,574
  • 9
  • 51
  • 70
0

That's not a job for Regular Expressions. PHP have built-in classes for parsing HTML files that allows you to query a node through the DOM.

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);

$xpath = new DOMXPath($dom);
$pics = $xpath->query('//div[@class="ti"]/div[@class="pic"]');

$data = [];
foreach ($pics as $pic) {
    $data[] =[
        'href' => $pic->getElementsByTagName('a')[0]->getAttribute('href'),
        'src' => $pic->getElementsByTagName('img')[0]->getAttribute('src'),
        'conetnt' => trim($pic->textContent)
    ];
}

print_r($data);

Output:

Array
(
    [0] => Array
        (
            [href] => /categories/rr/1.html
            [src] => http://www.erty.com/images/440f2d2a.jpg
            [conetnt] => Ind (98)
        )

    [1] => Array
        (
            [href] => /categories/ert/1.html
            [src] => http://www.erty.com/images/4123d2b.jpg
            [conetnt] => Wes (6044)
        )

)
revo
  • 47,783
  • 14
  • 74
  • 117
0
$regex = '/href="(.*?)".*src="(.*?)".*alt="(.*?)".*\((\d+)\)/ms';

$string = '
<div class="ti"><div class="pic">
        <a href="/categories/rr/1.html"><img src="http://www.erty.com/images/440f2d2a.jpg" alt="Ind"> <span>Ind</span></a> (98)
    </div></div><div class="ti"><div class="pic">
        <a href="/categories/ert/1.html"><img src="http://www.erty.com/images/4123d2b.jpg" alt="Wes"> <span>Wes</span></a> (6044)
    </div></div>
';

preg_match_all($regex, $string, $matches);

print_r($matches);

OUTPUT:

Array
(
    [0] => Array
        (
            [0] => href="/categories/rr/1.html"><img src="http://www.erty.com/images/440f2d2a.jpg" alt="Ind"> <span>Ind</span></a> (98)
    </div></div><div class="ti"><div class="pic">
        <a href="/categories/ert/1.html"><img src="http://www.erty.com/images/4123d2b.jpg" alt="Wes"> <span>Wes</span></a> (6044)
        )

    [1] => Array
        (
            [0] => /categories/rr/1.html
        )

    [2] => Array
        (
            [0] => http://www.erty.com/images/4123d2b.jpg
        )

    [3] => Array
        (
            [0] => Wes
        )

    [4] => Array
        (
            [0] => 6044
        )

)
Octavian
  • 155
  • 8