0

I have a database full of titles and descriptions from rss feed items from different sources and langagues...

This question is not about white spaces, but about keeping words and punctuation.

I'm trying to keep ONLY words WITH punctuation like ' " , . ; ( ) ! ? and also remove tabs, double spaces, new lines, etc.

I have a partially working solution, but in my database I still see new lines paragraphs, empty new lines... I also remove tags because I want to keep only the text.

$onlywords = strip_tags(html_entity_decode($insUrlsOk['rss_summary'])); //html_entity_decode because some times it's &lt; instead of <
$onlywords = trim($onlywords); // works partially -->> I still have new lines paragraphs, empty new lines
$onlywords = preg_replace('/[^\w\s]+/u',' ',$onlywords); //keeps ONLY words from any langages but also remove punctuation
$onlywords = str_replace('  ',' ',$onlywords);

I think my preg pattern '/[^\w\s]+/u' needs to be a bit more refined...

I'm also open to other solution as long as it is short and stays within few lines of code (without extra plugins to install in server).

Thanks.

  • Possible duplicate of [Remove multiple whitespaces](https://stackoverflow.com/questions/2326125/remove-multiple-whitespaces) – mickmackusa Jul 15 '17 at 23:48

1 Answers1

0

trim() only removes whitespace at the beginning and end of the string, so it won't get rid of paragraphs.

Newlines and tabs are included in \s, so the preg_replace() keeps them. Use preg_replace instead of str_replace to turn all sequences of whitespace into a single space:

$onlywords = preg_replace('/\s{2,}/', ' ', $onlywords);
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Hi, thanks. It worked ;). I also modified my Regex Pattern to `$onlywords = preg_replace('/[^\w.,%!?]+/u',' ',$onlywords);`. the `[^`... is a "anything but..." and the `\w` is words. •• I added the `.` `,` ... just after the `\w` –  Jul 15 '17 at 19:06