how to extract text using preg_match()?

Question

Possible Duplicate:
How to parse and process HTML with PHP?

I have following text stored in a variable $new

<div class="img">
<span style="float:left; color:#666;">1.&nbsp;&nbsp;</span>
<a href="/Books/info/J-R-R-Tolkien/The-Lord-of-the-Rings/0618640150.html?utm_term=lord+of+the+ring_1_1">
<img src="http://cdn-img-b-tata.infibeam.net/img/6a53fabc/157/0/9780618640157.jpg?wid=90&hei=113" width="90" height="113" border="0">
</a>
</div>
<span class="title">
<h2 class="simple"><a href="/Books/info/J-R-R-Tolkien/The-Lord-of-the-Rings/0618640150.html?utm_term=lord+of+the+ring_1_1"><em>Lord</em> of the <em>Rings</em></a></h2>
&nbsp;By
<a href="/Books/search?author=J R R Tolkien" style="font-size:12px; text-decoration:none;">J R R Tolkien</a>
<span style="color:#666666; font-size:11px;">[Paperback 2005, 50th Edition]</span>
</span>
<div class="price" style="line-height:30px;margin-top:0px;">

I have to extract text starting from 1.&nbsp to <div. I've tried all possible solutions, but wasn't successful.

Billy is true. Especially when preg_match isn't the function you need, because it returns a boolean. Seems like you haven't tried a lot. — Jerska, Aug 06 '12 at 13:48
Parsing HTML with REGEX? s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g — Dejan Marjanović, Aug 06 '12 at 13:49
Regex is the wrong tool for the job, use an html parser, can someone please link to that really old, and mad dog thread all about this subject please. (the one which has been edited a thousand times by people trying to make it more or less sensible according to their taste) — Billy Moon, Aug 06 '12 at 13:50
Probably related: http://stackoverflow.com/a/1732454/1428773. Parsing HTML with RegEx is not a good idea. You should try making `$new` a more machine-readable data like a `JSON`-created array. It's more usable my scripts and from it, you can create the output you want. Or do a hack using various [`explode()`](http://php.net/manual/en/function.explode.php) statements. — Whisperity, Aug 06 '12 at 13:54

score 1 · Answer 1 · answered Aug 06 '12 at 13:47

1

This should work

$ret = preg_replace ("#1\.&nbsp(.+)<div#isU", "$1", $new);

with $new containing all your html.
Still, regexes aren't the only way to achieve what you want, and especially not the best one.

answered Aug 06 '12 at 13:47

Jerska

11,722
4
35
54

Hmm, if that works perfectly, I wonder why a simple, offset based string search and then a substring operation didn't make it in the first place. There is more in string operations than regex. – hakre Aug 06 '12 at 13:58

score 1 · Answer 2 · edited May 23 '17 at 12:02

1

The simple answer is: YOU DON'T. EVER. HTML is not a regular language, therefore regular expressions CAN NOT PARSE HTML. You need to use an HTML parser which exists in php as DOM.

For more information on why regular expressions don't work with HTML, read this thread. The pony. He comes.

edited May 23 '17 at 12:02

Community

1
1

answered Aug 06 '12 at 13:55

Matt

6,993
4
29
50

2

Pony, he is tired of constantly coming. T͎̹̪̤̤͌ͭ͗ͭ̌̌͊ḧ́̆̓ͯ̄͑̑͂͛͏͖̼͓̤̺ͅḙ̦͖̥ͩ͠͝r͓̈̈́ͯͩ̋ẹ͇͕̖̎ͭͨ́͘ ̸̫͖̞̝͌͂̈̓ͯ́͒̀i͇̙̇̉ͩͦ̿̓͆̾ͤ͝s̻̄̎̆̽̃ͬͧͩ͛͟͡ ̢̢͈͔̺̺͕͕́̉̓ͯͮ̒ͨ̀ṉ̡̩͙͇ͬ͌́͡o̧̖̹͙͛ͣ̋́̄̕ ̴̥̠̻̬̳̻̻̙̯̏̑̏ȟ͚͙̯̮ͨ͑͟o͓̻͖͙̞̎͗ͥͫ́͝ͅp̢͔͓̫̈́̎̆͋̍e̛̛͎̜̲̠̹̯̎̄ͧ̊̆͋͢!̧̹̤̞̟͎̱͖̦ͧ̔ͩ – Dejan Marjanović Aug 06 '12 at 13:57

score 0 · Answer 3 · answered Aug 06 '12 at 13:52

0

If that's really all the code, this should suffice

strip_tags($html);

answered Aug 06 '12 at 13:52

dualed

10,262
1
26
29

how to extract text using preg_match()?

3 Answers3