Extracting text from html?

Question

I have a string as below

<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>

I want to extract text from above HTML as Hello World, this is StackOverflow's question details page notice that I want to remove the   as well.

How we can achieve this in PHP, I tried few functions, strip_tags, html_entity_decode etc, but all are failing in some conditions.

Please help, Thanks!

Edited my code which I am trying is as below, but its not working :( It leaves the   and ' this type of characters.

$TMP_DESCR = trim(strip_tags($rs['description']));

as @jakenoble says would help if you posted your sample code & output & errors. — diagonalbatman, Feb 02 '11 at 11:46
If the shown string is part of a full HTML page or a larger snippet containing additional markup, please see [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Feb 02 '11 at 11:47
@Gordon its not a big html, I just want to do it with simple methods :( — djmzfKnm, Feb 02 '11 at 11:54

score 1 · Accepted Answer · answered Feb 02 '11 at 12:06

1

Below worked for me...had to do a str_replace on the non-breaking space though.

$string = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";
echo htmlspecialchars_decode(trim(strip_tags(str_replace('&nbsp;', '', $string))), ENT_QUOTES);

answered Feb 02 '11 at 12:06

Aaron W.

9,254
2
34
45

yes, that's working for me as well. If there is no solution for ` ` then its fine, we can go with replace. Thanks for the help! – djmzfKnm Feb 02 '11 at 12:11

score 0 · Answer 2 · answered Feb 02 '11 at 11:45

0

strip_tags() will get rid of the tags, and trim() should get rid of the whitespace. I'm not sure if it will work with non-breaking spaces though.

answered Feb 02 '11 at 11:45

sevenseacat

24,699
6
63
88

score 0 · Answer 3 · edited Feb 02 '11 at 11:57

0

First, you'll have to call trim() on the HTML to remove the white space. http://php.net/manual/en/function.trim.php

Then strip_tags, then html_entity_decode.

So: html_entity_decode(strip_tags(trim(html)));

edited Feb 02 '11 at 11:57

djmzfKnm

26,679
70
166
227

answered Feb 02 '11 at 11:46

Rui Jiang

1,662
1
15
25

lonesomeday · Answer 4 · 2011-02-02T12:01:46.370

Probably the nicest and most reliable way to do this is with genuine (X|HT)ML parsing functions like the DOMDocument class:

<?php

$str = "<p>&nbsp;Hello World, this is StackOverflow&#39;s question details page</p>";

$dom = new DOMDocument;
$dom->loadXML(str_replace('&nbsp;', ' ', $str));

echo trim($dom->firstChild->nodeValue);
// "Hello World, this is StackOverflow's question details pages"

This is probably slight overkill for this problem, but using the proper parsing functionality is a good habit to get into.

Edit: You can reuse the DOMDocument object, so you only need two lines within the loop:

$dom = new DOMDocument;
while ($rs = mysql_fetch_assoc($result)) { // or whatever
    $dom->loadHTML(str_replace('&nbsp;', ' ', $rs['description']));
    $TMP_DESCR = $dom->firstChild->nodeValue;

    // do something with $TMP_DESCR
}

seems a long method and as I am running a loop, so I think this will be extensive. — djmzfKnm, Feb 02 '11 at 11:55

Extracting text from html?

4 Answers4