0

Possible Duplicate:
How to parse and process HTML with PHP?

Can somebody help me to find a solution to parse text which has HTML and regular text. For example

This is my awesome <b>text</b>. Now <a href="http://google.com">starts</a> a new line...

<img src="http://example.com/image.png"/><br>
<br>
I push news to http://twitter.com .

This should become

This is my awesome <b>text</b>. Now <a href="http://google.com">starts</a> a new line...<br>
<br>
<img src="http://example.com/image.png"/><br>
<br>
I push news to <a href="http://twitter.com">twitter.com</a> .

I'm searching mainly for a magic regex replace function...At the moment I do

$text = preg_replace("@(src|href)=\"https?://@i",'\\1="', $description);
$text = nl2br(preg_replace("@(((f|ht)tp:\/\/)[^\"\'\>\s]+)@",'<a href="\\1" target="_blank">\\1</a>', $text));
Community
  • 1
  • 1
Laoneo
  • 1,546
  • 1
  • 18
  • 25
  • 2
    read this thread: http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php – Caspar Kleijne Oct 19 '12 at 12:42
  • I don't want to extract information from it. I want to convert some text which is not HTML code into HTML. For example links should be converted into clickable links but when they are already in a tag it should be ignored... changing the title – Laoneo Oct 19 '12 at 12:47
  • You need toparse the html before you kan modify it like this. – Caspar Kleijne Oct 19 '12 at 13:52
  • is the input text always going to be in the same way you posted or may change? – PatomaS Oct 19 '12 at 14:06
  • It will change sometimes the text is like the user posted here http://g4j.laoneo.net/support/forum/11-com-gcalendar/16965-imporve-parser-to-support-images.html#17688 or sometimes only text without any HTML content. At the moment the product does simple link extracting but as soon as a sdf comes it fails..I've updated the question with what I have so far... – Laoneo Oct 19 '12 at 14:11

2 Answers2

3

nl2br does the trick nicely.

file_get_contents('filename.html');
nl2br($text);

It was designed specifically for your needs.

If you're worried about double \ns or already present <br /> elements you have to devise a scheme either for the input text (if you have control over it) or for preprocessing.

Perhaps replacing all \n\n with \n and all <br />\n with \n before applying nl2br.

Mihai Stancu
  • 15,848
  • 2
  • 33
  • 51
0

You can try this

$text = your source text
$text = preg_replace(
    array('/\n/m',  '/\<br\>\<br\>/m' '/\<br\>$/'),
    array("\n<br>", "<br>", ''),
    $text
);

bye

Mihai Stancu
  • 15,848
  • 2
  • 33
  • 51
PatomaS
  • 1,603
  • 18
  • 25
  • Using regular expressions to replace things that simple string replacement would be able to do is pretty wasteful with resources. – Mihai Stancu Oct 19 '12 at 14:49
  • true, although it depends a lot in the amount of transformations to be made. In any case, I went for this approach because I thought, that the provided example just seems random, at least to me, and there may be some more information to be parsed or complex cases that will come up after he tries the answers given, then preg may offer more flexibility. But of course, all that process is just in my mind and I may be completely wrong considering what I did. – PatomaS Oct 19 '12 at 15:07
  • About the edit, I posted it in the way I did because I think it's easier to understand. Still, your edit is ok – PatomaS Oct 19 '12 at 15:10