HTML parsing with regexp

Question

I have this string

<td align='right'>
    <a class='texto'>119</a>
</td>
<td align='right'>
    <a class='texto'>SITEX (ST)</a>
</td>
<td align='right' onmouseover="ponerdc(this,'0001','CARNE EDUARDO','PRO PINT','LAS HERAS 3252 (7600) MAR DEL PLATA SUR - Buenos Aires - 054','LAS HERAS 3252 (7600) MAR DEL PLATA SUR - Buenos Aires - 054','20-17179119-8','02234942484','I; IVA INSCRIPTO RET GAN: 21%; IIBB: No',1,'152');" onmouseout="sacardc();">
    <a class='texto'>0001</a>
</td>
<td align='right'>
    <a class='texto'>Costoya, Karina</a>
</td>
<td align='right'>
    <a class='texto'>152</a>
</td>
<td align='right'>
    <a class='texto' onmouseover="ponerobs(this,'1 Productos');" onmouseout="sacarobs();">6</a>
</td>
<td align='right'>
    <a class='texto'>$ 1,493.58</a>
</td>
<td align='right' >
    <a class='texto' onmouseover="ponerobs('STOCK');" onmouseout="sacarobs();">06/01/2014&nbsp;12:20</a></td><td align='right'><a class='texto'><b>Pendiente</b></a>
</td>
<td align='right'>
    <a class='texto'><b>Pendiente</b></a>
</td>

I'm trying to get only the rows data I mean only "119", "SITEX (ST)", "0001", etc

I tried this

foreach ($tabla_data as $line){
            //each "line" is like the string example
    $line = str_replace("</tr>", "", $line);
    if (preg_match("/<td/", $line)){
        $line = preg_replace("/<td([^>.]|\.|\,)*>(.)*<\/td>/", "($1)($2)       -       ", $line);
        echo $line."\n\n\n";
    }
}

But doesn't work as expected...

The right output should be

119


SITEX (ST)


0001


Costoya, Karina


152


0


$0.00


Pendiente


Pendiente


Pendiente

Why don't you try an XML parser either, it would be easy than using regexp — Carlos487, Jan 06 '14 at 16:59
[don't use regex to parse HTML](http://stackoverflow.com/q/1732348/497418). PHP has built-in DOM parsing capabilities, make use of them. — zzzzBov, Jan 06 '14 at 16:59
You didn't search SO with the title of your question, did you ? You would have seen that regular expression aren't a parsing tool. — Denys Séguret, Jan 06 '14 at 16:59
I'll just leave this here: [You can't parse non-regular languages with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — rockerest, Jan 06 '14 at 16:59
HTML is not regular to be parsed using a regex. [Use an HTML parser instead](https://eval.in/86307). — Amal Murali, Jan 06 '14 at 17:00
Ok I can't know everything, I thought that it would be easier to parse html with regexp. — Nelson Galdeman Graziano, Jan 06 '14 at 17:01
@NelsonGaldemanGraziano: Unfortunately, it's not. Using regex here is actually harder than the alternatives :-) — gen_Eric, Jan 06 '14 at 17:02
people are slating you as you didn't do a search first - but don't worry I got blasted once for trying to do summit dumb - until you know a better way REGEX that sh*t! — GrahamTheDev, Jan 06 '14 at 17:02
@NelsonGaldemanGraziano: It is okay to be wrong. Now, have a look at this question: [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php). — Amal Murali, Jan 06 '14 at 17:03
Basically HTML is XML so you can use them also. HTML parser are used for more advanced stuff like having a DOM tree — Carlos487, Jan 06 '14 at 17:03

score 2 · Accepted Answer · answered Jan 06 '14 at 17:04

2

As you have had quite a ribbing here is what you should look at

http://simplehtmldom.sourceforge.net/

It is a GREAT tool that will let you use nested selectors (think jquery) to pin-point what you are looking for with ease!

answered Jan 06 '14 at 17:04

GrahamTheDev

22,724
2
32
64

3

I'd use the [`DOMDocument`](http://php.net/dom) class instead. (available natively) – Amal Murali Jan 06 '14 at 17:14
Marked as useful answer - but I do find simpleHTMLDom parser easier to use! – GrahamTheDev Jan 06 '14 at 17:25

HTML parsing with regexp

1 Answers1