-1

Can anyone tell me why this regex is causing a php segmentation fault?

$text = preg_replace('~[\s\r\n]+(?=(?:(?!<tr).)*<\/tr>)~is', ' ', $text);

I need to remove line breaks (\n\r) in tr elements. Maybe there is a better regex to do that or maybe there is a non-regex solution?

UPDATE:

I need to remove line breaks only inside tr element. Other line breaks should be untouched.

UPDATE2:

I am not parsing HTML with regex. I am getting email body (it can be huge html without tables, it can be plain text already), removing line breaks in tr's, stripping HTML tags and using plain text.

UPDATE3:

Please do not answer "use parser" or downvote. I don't think it suits this case very well and if I am wrong please explain why am I wrong. I will really apreciate it. Thank you.

Lukas Ignatavičius
  • 3,496
  • 2
  • 24
  • 29
  • `maybe there is non-regex solution?` Sure, use a HTML parser. – Toto Jan 09 '14 at 13:13
  • 3
    [Obligatory...](http://stackoverflow.com/a/1732454/1223693) do **not** use regex to parse HTML. – tckmn Jan 09 '14 at 13:15
  • @pregmatch ty for sugestion, but I am parsing emails so lynx is not an option – Lukas Ignatavičius Jan 09 '14 at 13:28
  • @DoorknobofSnow I don't use regex to parse HTML. I am using it to remove line breaks in `` tag and then I am stripping html tags and using plain text. – Lukas Ignatavičius Jan 09 '14 at 13:33
  • The seg fault probably is caused by stack overflow, and I guess it is due to `(?! – nhahtdh Jan 09 '14 at 13:39
  • @LukasIgnatavičius That's parsing html and removing portions of it.. still very much parsing and susceptible to the dangers of it. – Mike B Jan 09 '14 at 13:42
  • @MikeB I am removing `\n\r` and I think these are not part of HTML – Lukas Ignatavičius Jan 09 '14 at 14:11
  • @LukasIgnatavičius So? Your regex depends on the html being semantically valid and regular.. which HTML is not. Did you read the [famous diatribe](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) which comically details all the reasons why this is not necessarily the way to go? I'm starting to feel like bobince, repeating myself over and over. – Mike B Jan 09 '14 at 14:14
  • @MikeB If regex is such a bad solution to this problem could you please give me better way to achieve same result `$text = $email_body;` `$text = strip_tags(preg_replace_callback('##is',function($m){return preg_replace('/[\r\n]+/',' ',$m[0]);},$text));`? P.S. email body can already be plain text. – Lukas Ignatavičius Jan 09 '14 at 15:26
  • @LukasIgnatavičius People have given you every resource you need to convince you that it's possible and even PREFERRED over regex. There's volumes of questions with people asking how to parse html with regex and answers detailing how to use dom parsers instead. This is the 3rd time I've repeated myself. I'm not writing your code for you to convince you of something that shouldn't require much convincing. – Mike B Jan 09 '14 at 16:41
  • @MikeB I just don't think I should use HTML parser if HTML tags isn't important to my task with one exception. My plain text should not break in tr element. Should I load huge string with parser to remove few \n symbols from string? And that string can be without html already. I can write code myself, I am just asking for idea how to do it using your suggested method. And btw doesn't html igores \n symbols (except inputs)? – Lukas Ignatavičius Jan 09 '14 at 21:01

1 Answers1

2

I think preg_replace_callback() would be the best tool for the job. Try this:

$text = preg_replace_callback('#<tr.+?</tr>#is',
                              function($m){return preg_replace('/[\r\n]+/',' ',$m[0]);},
                              $text);
r3mainer
  • 23,981
  • 3
  • 51
  • 88