Inserting spaces between UTF8 words - what is wrong with my 2 simple regexes?

Question

I have a web page where users can leave comments in Russian language (UTF-8), like this one:

Хорошо, четко , уверено!Удачи!(БОРИС)

Some users "abuse" it for fun purposes by leaving out the spaces between words

НеСпитьсяЖукуНиЗимою,НиЛетом,лучшеПитатьсяСолнечнымСветом, лучшеСидетьЗаИгорнымСтолом,иНаслаждатьсяКаждымВистом, лучшеНоситьЗолотыеОдежды,искритьсяВсегда,неТеряяНадежды,лучшеПустьДругОстаетсяБезВзятки,ведьНевозможноЖукуЖитьБезЛапки!

which results in very wide HTML-table rows, breaking my layout.

I'm trying to counter those users by trying to find comments with over 60 non-space characters and inserting a single space char after punctuation chars (like commas) - with this piece of PHP-code:

            if (preg_match('/\S{60,}/u', $about) == 1) {
                    error_log('Splitting comment: ' . $about);
                    $about = preg_replace('/(\p{P}+\s*)/u', '$1 ', $about);
                    error_log('===Result comment: ' . $about);
            }

However this doesn't work and has at least 2 problems

Every comment is being matched, even short one like at the top
The \s* isn't greedy and "comma and space" is replaced by "comma space space" for some weird reason

Here an excerpt of my log file:

[04-Jun-2012 09:50:10] Splitting comment: Хорошо, четко , уверено!Удачи!(БОРИС)
[04-Jun-2012 09:50:10] ===Result comment: Хорошо,  четко ,  уверено! Удачи!( БОРИС)

[04-Jun-2012 09:50:10] Splitting comment: НеСпитьсяЖукуНиЗимою,НиЛетом,лучшеПитатьсяСолнечнымСветом,
лучшеСидетьЗаИгорнымСтолом,иНаслаждатьсяКаждымВистом, лучшеНоситьЗолотыеОдежды,искритьсяВсегда,неТеряяНадежды,лучшеПустьДругОстаетсяБезВзятки,ведьНевозможноЖукуЖитьБезЛапки!(nusja)
[04-Jun-2012 09:50:10] ===Result comment: НеСпитьсяЖукуНиЗимою, НиЛетом, лучшеПитатьсяСолнечнымСветом,
 лучшеСидетьЗаИгорнымСтолом, иНаслаждатьсяКаждымВистом,  лучшеНоситьЗолотыеОдежды, искритьсяВсегда, неТеряяНадежды, лучшеПустьДругОстаетсяБезВзятки, ведьНевозможноЖукуЖитьБезЛапки!( nusja)

I tried doubling the backslashes - this hasn't changed anything.

I'm using stock PHP with latest CentOS Linux 5.x and 6.x:

# php -v
PHP 5.3.3 (cli) (built: May  7 2012 17:58:11)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2010 Zend Technologies

@Pateman: It's [not Unicode-aware](http://stackoverflow.com/questions/3825226/multi-byte-safe-wordwrap-function-for-utf-8) however. — mario, Jun 04 '12 at 15:04
Yes, I get `PHP Warning: json_encode(): Invalid UTF-8 sequence in argument` later - so wordWrap() seems to break the Unicode chars :-( — Alexander Farber, Jun 04 '12 at 15:14

score 3 · Accepted Answer · answered Jun 04 '12 at 15:03

3

Try wordwrap(). It's a built-in function and I think that it should help you.

If you're looking for a reg-exp solution, there's one in the comments - look here. It's UTF-8 safe, so it should work for your Russian site.

answered Jun 04 '12 at 15:03

Pateman

2,727
3
28
43

Doesn't work: `$about = wordwrap($about, 40, ' ', true);` gives me **null** and a PHP Warning: `json_encode(): Invalid UTF-8 sequence in argument` – Alexander Farber Jun 04 '12 at 15:17
1

Try the UTF-8 version that I provided in the second link. :) – Pateman Jun 04 '12 at 15:19
1

Great - you have answered my 1st question as `[^\s]` seems to work better than `\S` (is it a bug in PHP 5.3?). And I realized the 2nd regex is better written as `$about = preg_replace('/(\p{P}+)\s*/u', '$1 ', $about);` :-) – Alexander Farber Jun 04 '12 at 15:29

score 1 · Answer 2 · answered Jun 04 '12 at 15:12

1

You could place the comment in a span or div element.

<td><span class="comment">лучшеПитатьсяСолнечнымСветом</span></td>

And put the following definition in your css file:

span.comment
{
    max-width: 300px;
    overflow: hidden;
}

This will simply hide anything that's bigger then you intended.

answered Jun 04 '12 at 15:12

Waut

11
1

Good suggestion, but I'd prefer not to hide anything, but just break few too long words. Also I wonder what is wrong with my regexes - as I tend to use regexes often (mostly in Perl, but have to use PHP this time). – Alexander Farber Jun 04 '12 at 15:15

Inserting spaces between UTF8 words - what is wrong with my 2 simple regexes?

2 Answers2