Getting the href attribute and text of certain kind of links

Question

Of these four links:

<img border="0" src="imagenes/flech.gif" width="6" height="8">

<a href="escuchar-baladas-de-Albano_Y_Romina_Power.html">Albano Y Romina Power</a><br>
<img border="0" src="imagenes/flech.gif" width="6" height="8">

<a href="escuchar-baladas-de-Armando_Manzanero.html">Armando Manzanero</a><br>

<a name="inicio21" href="musica-Merengue-de-Banda_Cuisillos.html">
<img border="0" src="imagenes/flech.gif" width="6" height="8">Banda Cuisillos</a><br>

<a href="Musica-Baladas-Alternativas.html">Baladas Alternativas</a><br>

I'm trying to capture the href value and the text of the link of the three first, leaving out the fourth link, in other words i'm trying to get this:

escuchar-baladas-de-Albano_Y_Romina_Power.html    Albano Y Romina Power
escuchar-baladas-de-Armando_Manzanero.html    Armando Manzanero
musica-Merengue-de-Banda_Cuisillos.html    Banda Cuisillos

I was trying to make the most of the fact that the three first have imagenes/flech.gif and that way leave out the fourth, the thing that imagenes/flech.gif isn't in the same order. Here is my attempt to solve it where i get up to the href but include the fourth.

Thanks for any help

Obligatory [answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454), but if you just need to parse those 4 links as they appear assuming they are never going to change I imagine an answer will come up here soon. — MattSizzle, Apr 26 '14 at 02:14
@Tuga the two first are in the same order just as in the link i put — user2495207, Apr 26 '14 at 03:17

Pedro Lobito · Accepted Answer · 2014-04-26T03:40:50.923

You should use an html parser and not a regex, try this:

<?php

$html = <<< EOF
<img border="0" src="imagenes/flech.gif" width="6" height="8">

<a href="escuchar-baladas-de-Albano_Y_Romina_Power.html">Albano Y Romina Power</a><br>
<img border="0" src="imagenes/flech.gif" width="6" height="8">

<a href="escuchar-baladas-de-Armando_Manzanero.html">Armando Manzanero</a><br>

<a name="inicio21" href="musica-Merengue-de-Banda_Cuisillos.html">
<img border="0" src="imagenes/flech.gif" width="6" height="8">Banda Cuisillos</a><br>

<a href="Musica-Baladas-Alternativas.html">Baladas Alternativas</a><br>
EOF;


$dom = new DOMDocument();
@$dom->loadHTML($html);

# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('a') as $link) {

    $url =  $link->getAttribute('href');
    $text = preg_replace('/[\r\n]/sm', '', $link->nodeValue); // remove line breaks

    //if doesn't contain the banned words...
    if (!preg_match('/(Baladas Alternativas|another text to filter)/sm', $text)) {
        echo $url ." ".$text. "\n";
    } 

}
?>

DEMO
http://ideone.com/5QX83x

RESOURCES
http://htmlparsing.com/php.html

@user2495207 if the html changes you still get the results :) If my answer helped you, please consider accepting it as the correct answer, tks! — Pedro Lobito, Apr 26 '14 at 12:43

Sirius_Black · Answer 2 · 2014-04-29T21:47:12.097

0

this code will get the first 3 links

$a='<img border="0" src="imagenes/flech.gif" width="6" height="8"><a href="escuchar-baladas-de-Albano_Y_Romina_Power.html">Albano Y Romina Power</a><br><img border="0" src="imagenes/flech.gif" width="6" height="8"><a href="escuchar-baladas-de-Armando_Manzanero.html">Armando Manzanero</a><br><a name="inicio21" href="musica-Merengue-de-Banda_Cuisillos.html"><img border="0" src="imagenes/flech.gif" width="6" height="8">Banda Cuisillos</a><br><a href="Musica-Baladas-Alternativas.html">Baladas Alternativas</a><br>';

preg_match_all('/<a.*?href="(.+?)">(?:<img.*\d+">)?(.+?)<\/a>/',$a,$match);


echo $match[1][0] . "  " . $match[2][0]."<br>";
echo $match[1][1] . "  " . $match[2][1]."<br>";
echo $match[1][2] . "  " . $match[2][2]."<br>";

edited Apr 29 '14 at 21:47

answered Apr 26 '14 at 03:21

Sirius_Black

471
3
11

But the `$match[2][2]` is `Banda Cuisillos` instead of `Banda Cuisillos`.Thanks anyway – user2495207 Apr 26 '14 at 12:37
i know you already choose an answer, but check my edited code – Sirius_Black Apr 28 '14 at 02:19
`$match[2][2]` says `Baladas Alternativas` instead of `Banda Cuisillos` – user2495207 Apr 28 '14 at 21:00
code edited, didnt see that was returning the other band – Sirius_Black Apr 29 '14 at 21:51

Getting the href attribute and text of certain kind of links

2 Answers2

Linked