25

How can I trim a string(6) " page", where the first whitespace is a 0xc2a0 non-breaking space?

I've tried trim() and preg_match('/^\s*(.*)\s*$/u', $key, $m);.

Another question: How can I reliably copy these characters? They seem to be converted to "normal" spaces, which makes it hard to debug.

miken32
  • 42,008
  • 16
  • 111
  • 154
Markus Hedlund
  • 23,374
  • 22
  • 80
  • 109
  • What do you mean with 'copy'? Copy what from where? From web page to IDE/editor? – Anti Veeranna Nov 12 '10 at 17:05
  • @antil i would say from browser. i have a similar problem and chrome, firefox etc, just show it as a space in utf-8 and  when in ISO-8859-1 –  Jan 08 '11 at 16:44

7 Answers7

52
preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$str);
bcosca
  • 17,371
  • 5
  • 40
  • 51
  • 2
    Sadly, before I made my way down to your answer, I found it myself. But you forgot the `u` modifier, and `s` makes little sense. Working code: `preg_replace('/^\p{Z}+|\p{Z}+$/u', '', $key)`. – Markus Hedlund Nov 12 '10 at 17:59
  • 2
    Updated to `preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u'`. – Markus Hedlund Nov 12 '10 at 18:11
  • @Znarkus: Does PHP not support, or at least not encourage, using non-insane-mode on regexes via `/x` or `(?x)`? – tchrist Nov 13 '10 at 13:57
  • If I want to remove all white spaces, do I also need to call trim or is this enough? – Adam Feb 22 '19 at 15:44
  • 1
    @Adam: The expression will trim ALL leading/trailing whitespaces without the need for trim(). – bcosca Feb 24 '19 at 23:29
  • This works for me. I needed to remove some strange "ZWSP" unicode characters and this snippet worked. I'm somewhat familiar with regex, but can somebody explain why this works? I don't understand the idea of matching the same char pattern with similar quantifiers and a simple OR condition. – fnagel Apr 07 '22 at 21:46
6

PCRE unicode properties properties can be used to achieve this

Here is the code that I played with and seems to do what you want:

<?php
function unicode_trim ($str) {
    return preg_replace('/^[\pZ\pC]+([\PZ\PC]*)[\pZ\pC]+$/u', '$1', $str);
}

$key = chr(0xc2) . chr(0xa0) . '#page#' . chr(0xc2) . chr(0xa0);

var_dump(unicode_trim($key));

Result

[~]> php e.php
string(6) "#page#"

Explanation:

\p{xx} a character with the xx property \P{xx} a character without the xx property

If xx has only one character, then {} can be dropped, e.g. \p{Z} is the same as \pZ

Z stands for all separators, C stands for all "other" characters (for example control characters)

sanmai
  • 29,083
  • 12
  • 64
  • 76
Anti Veeranna
  • 11,485
  • 4
  • 42
  • 63
  • That would delete any non letter character (#!%123...). – Vincent Savard Nov 12 '10 at 17:06
  • Which trailing character did it not strip? It definitely does strip trailing 0xc2a0 - as you can see from the output I pasted above. But I have updated the regexp a few times, so maybe you tried with an earlier one. – Anti Veeranna Nov 12 '10 at 18:00
  • Normal spaces, think it was the latest version. Thanks a lot though, very helpful! – Markus Hedlund Nov 12 '10 at 18:13
  • @Anti Veeranna: Which version of PCRE does PHP use? I had the impression it was a bit behind, but I may be wrong. – tchrist Nov 13 '10 at 13:54
5

The existing solution mention only \pZ characters. However, there are six Unicode whitespace characters that fall outside the purview of that property:

% unichars '\p{WhiteSpace}' '\PZ'
 --    9 0009 CHARACTER TABULATION
 --   10 000A LINE FEED (LF)
 --   11 000B LINE TABULATION
 --   12 000C FORM FEED (FF)
 --   13 000D CARRIAGE RETURN (CR)
 --  133 0085 NEXT LINE (NEL)

Those six are all of type \pC, and in particular, type \p{Cc}. However there are fifty-nine non-whitespace characters that are also \p{Cc}:

% unichars '\P{WhiteSpace}' '\p{Cc}' | wc -l
      59

The simple version of my own test for whether something is a printable character or not is simply [\pZ\pC]; that’s what unichars uses, for example.

A more careful test would consider whether something should take up 0, 1, or 2 print positions. That requires considering whether it’s a combining Mark, which is property \pM, and also whether it has the half-width or full-width properties. For example:

% uniprops ff5e ffeb
U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
       CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
       Math_Symbol Print Symbol
U+FFEB ‹→› \N{ HALFWIDTH RIGHTWARDS ARROW }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded
       CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base Graph GrBase Math
       Math_Symbol Print Symbol

For those, you would need to use the non-binary East Asian Width property. These are applicable:

% uniprops -l | grep -i width
Block:Halfwidth_And_Fullwidth_Forms
InHalfwidthAndFullwidthForms
East_Asian_Width:A
East_Asian_Width=Ambiguous
East_Asian_Width:Ambiguous
East_Asian_Width:F
East_Asian_Width=Fullwidth
East_Asian_Width:Fullwidth
East_Asian_Width:H
East_Asian_Width=Halfwidth
East_Asian_Width:Halfwidth
East_Asian_Width=Neutral
East_Asian_Width:Na
East_Asian_Width=Narrow
East_Asian_Width:Narrow
East_Asian_Width:Neutral
East_Asian_Width:W
East_Asian_Width=Wide
East_Asian_Width:Wide

Those have abbreviations like \p{Ea=F} and \p{Ea=H}. There are a bunch of these:

% uninames '(FULL|HALF)WIDTH' | wc -l
     454

Of course, you mustn’t go on names for these things, but on properties:

% unichars '[\p{Ea=F}\p{Ea=H}]' | wc -l
     227
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}]' | wc -l
     338
% unichars '[\p{Ea=F}\p{Ea=H}\p{Ea=Na}\pM]' | wc -l
    1488

To show you how many, many properties these things truly have, here’s the full property dump of three different characters, run against Unicode 5.2:

% uniprops -ga NEL "COMBINING TILDE" ff5e 
U+0085 ‹U+0085› \N{ NEXT LINE (NEL) }:
    \s \v \R \pC \p{Cc}
    All Any Assigned InLatin1 C Other Cc Cntrl Common Zyyy Control Pat_WS Pattern_White_Space PatWS Space SpacePerl VertSpace
       White_Space WSpace
    Age:1.1 Bidi_Class:B Bidi_Class=Paragraph_Separator Bidi_Class:Paragraph_Separator Bc=B Block:Latin_1
       Block=Latin_1_Supplement Block:Latin_1_Supplement Blk=Latin1 General_Category=Other Canonical_Combining_Class:0
       Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR
       General_Category=Control Script=Common Decomposition_Type:None Dt=None East_Asian_Width=Neutral East_Asian_Width:Neutral
       General_Category:C General_Category:Cc General_Category:Cntrl General_Category:Control Gc=Cc General_Category:Other Gc=C
       Grapheme_Cluster_Break:CN Grapheme_Cluster_Break=Control Grapheme_Cluster_Break:Control GCB=CN Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
       Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:Next_Line Lb=NL
       Line_Break:NL Line_Break=Next_Line Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
       Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
       Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
       Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:SE Sentence_Break=Sep Sentence_Break:Sep SB=SE Word_Break:Newline WB=NL
       Word_Break:NL Word_Break=Newline
U+0303 ‹̃› \N{ COMBINING TILDE }:
    \w \pM \p{Mn}
    All Any Assigned InCombiningDiacriticalMarks Case_Ignorable CI Dia Diacritic M Mn Gr_Ext Grapheme_Extend Graph GrExt
       ID_Continue IDC Inherited Zinh Mark Nonspacing_Mark Print Qaai Word XID_Continue XIDC
    Age:1.1 Bidi_Class:Nonspacing_Mark Bc=NSM Bidi_Class:NSM Bidi_Class=Nonspacing_Mark Block:Combining_Diacritical_Marks
       Canonical_Combining_Class:230 Canonical_Combining_Class=Above Canonical_Combining_Class:A
       Canonical_Combining_Class:Above Ccc=A Decomposition_Type:None Dt=None East_Asian_Width:A East_Asian_Width=Ambiguous
       East_Asian_Width:Ambiguous Ea=A General_Category:M General_Category=Mark General_Category:Mark Gc=M General_Category:Mn
       General_Category=Nonspacing_Mark General_Category:Nonspacing_Mark Gc=Mn Grapheme_Cluster_Break:EX
       Grapheme_Cluster_Break=Extend Grapheme_Cluster_Break:Extend GCB=EX Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Script=Inherited
       Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:T Joining_Type=Transparent Joining_Type:Transparent Jt=T
       Line_Break:CM Line_Break=Combining_Mark Line_Break:Combining_Mark Lb=CM NFC_Quick_Check:M NFC_Quick_Check=Maybe
       NFC_Quick_Check:Maybe NFCQC=M NFKC_Quick_Check:M NFKC_Quick_Check=Maybe NFKC_Quick_Check:Maybe NFKCQC=M
       Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1
       In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1
       Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Inherited Sc=Zinh Script:Qaai Script:Zinh
       Sentence_Break:EX Sentence_Break=Extend Sentence_Break:Extend SB=EX Word_Break:Extend WB=Extend
U+FF5E ‹~› \N{ FULLWIDTH TILDE }:
    \pS \p{Sm}
    All Any Assigned InHalfwidthAndFullwidthForms Changes_When_NFKC_Casefolded CWKCF Common Zyyy Sm S Gr_Base Grapheme_Base
       Graph GrBase Math Math_Symbol Print Symbol
    Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral Bidi_Class:Other_Neutral Bc=ON Block:Halfwidth_And_Fullwidth_Forms
       Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR
       Canonical_Combining_Class:NR Script=Common Decomposition_Type:Non_Canon Decomposition_Type=Non_Canonical
       Decomposition_Type:Non_Canonical Dt=NonCanon Decomposition_Type:Wide Dt=Wide East_Asian_Width:F
       East_Asian_Width=Fullwidth East_Asian_Width:Fullwidth Ea=F General_Category:Math_Symbol Gc=Sm General_Category:S
       General_Category=Symbol General_Category:Sm General_Category=Math_Symbol General_Category:Symbol Gc=S
       Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
       Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group
       Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Line_Break:ID
       Line_Break=Ideographic Line_Break:Ideographic Lb=ID Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1
       Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2
       In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
       Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX Sentence_Break:XX Sentence_Break=Other Word_Break:Other
       WB=XX Word_Break:XX Word_Break=Other

Pretty stunning, eh?

If you’ve read this far and would like to know where to get the three Unicode utilities illustrated above, uniprops, unichars, and uninames, please send me mail, because the current links aren’t working right now.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 1
    There's also some stuff in the other/format block - mongolian vowel separator, zero-width (space/joiner/non-joiner), word joiner, and zero width non-breaking space. Whether or not they're splittable should lead to massive arguments. – John Haugeland Aug 17 '15 at 20:39
2

Perhaps something from the multibyte string set of functions? http://php.net/manual/en/function.mb-ereg.php Can't see mb_trim, but there is a set of MB safe regex functions.

G

dartacus
  • 654
  • 1
  • 5
  • 16
1

This page may help:

http://nadeausoftware.com/articles/2007/9/php_tip_how_strip_punctuation_characters_web_page

philfreo
  • 41,941
  • 26
  • 128
  • 141
0

however here is my ONLY solution, because sometimes there are UTF8 spaces:

$stringg = preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u','',$stringg);
$stringg = preg_replace('/\s+/u', '', $stringg);
T.Todua
  • 53,146
  • 19
  • 236
  • 237
-1

None of the answers from above didn't actually work to remove trailing white spaces in utf-8 strings.

This solution found here works perfectly and is the shortest:

trim($str, "\t\n\r\0\x0B\xC2\xA0");
Vadim Podlevsky
  • 181
  • 1
  • 9
  • Your answer is **NOT** Unicode/UTF-8 compliant: you'll remove any byte with a value of 0xC2 **or** 0xA0 or ..., not the sequence of the two successive bytes 0xC2 then 0xA0 (the way U+00A0 is encoded in UTF-8). This might result (besides without assignment in your anwer) in corrupted UTF-8 strings! – julp Oct 03 '19 at 19:41
  • I get your point. Yes this might potentially break UTF strings, because its chars are multibyte. But on my tests I have encountered 0 errors with this function on Russian-Ukrainian-English texts. May be I'm just lucky, but that worked for me. – Vadim Podlevsky Oct 03 '19 at 21:10