396

I sometimes want to match whitespace but not newline.

So far I've been resorting to [ \t]. Is there a less awkward way?

Borodin
  • 126,100
  • 9
  • 70
  • 144
JoelFan
  • 37,465
  • 35
  • 132
  • 205

7 Answers7

535

Use a double-negative:

/[^\S\r\n]/

That is, not-not-whitespace (the capital S complements) or not-carriage-return or not-newline. Distributing the outer not (i.e., the complementing ^ in the character class) with De Morgan's law, this is equivalent to “whitespace but not carriage return or newline.” Including both \r and \n in the pattern correctly handles all of Unix (LF), classic Mac OS (CR), and DOS-ish (CR LF) newline conventions.

No need to take my word for it:

#! /usr/bin/env perl

use strict;
use warnings;

use 5.005;  # for qr//

my $ws_not_crlf = qr/[^\S\r\n]/;

for (' ', '\f', '\t', '\r', '\n') {
  my $qq = qq["$_"];
  printf "%-4s => %s\n", $qq,
    (eval $qq) =~ $ws_not_crlf ? "match" : "no match";
}

Output:

" "  => match
"\f" => match
"\t" => match
"\r" => no match
"\n" => no match

Note the exclusion of vertical tab, but this is addressed in v5.18.

Before objecting too harshly, the Perl documentation uses the same technique. A footnote in the “Whitespace” section of perlrecharclass reads

Prior to Perl v5.18, \s did not match the vertical tab. [^\S\cK] (obscurely) matches what \s traditionally did.

The same section of perlrecharclass also suggests other approaches that won’t offend language teachers’ opposition to double-negatives.

Outside locale and Unicode rules or when the /a switch is in effect, “\s matches [\t\n\f\r ] and, starting in Perl v5.18, the vertical tab, \cK.” Discard \r and \n to leave /[\t\f\cK ]/ for matching whitespace but not newline.

If your text is Unicode, use code similar to the sub below to construct a pattern from the table in the aforementioned documentation section.

sub ws_not_nl {
  local($_) = <<'EOTable';
0x0009        CHARACTER TABULATION   h s
0x000a              LINE FEED (LF)    vs
0x000b             LINE TABULATION    vs  [1]
0x000c              FORM FEED (FF)    vs
0x000d        CARRIAGE RETURN (CR)    vs
0x0020                       SPACE   h s
0x0085             NEXT LINE (NEL)    vs  [2]
0x00a0              NO-BREAK SPACE   h s  [2]
0x1680            OGHAM SPACE MARK   h s
0x2000                     EN QUAD   h s
0x2001                     EM QUAD   h s
0x2002                    EN SPACE   h s
0x2003                    EM SPACE   h s
0x2004          THREE-PER-EM SPACE   h s
0x2005           FOUR-PER-EM SPACE   h s
0x2006            SIX-PER-EM SPACE   h s
0x2007                FIGURE SPACE   h s
0x2008           PUNCTUATION SPACE   h s
0x2009                  THIN SPACE   h s
0x200a                  HAIR SPACE   h s
0x2028              LINE SEPARATOR    vs
0x2029         PARAGRAPH SEPARATOR    vs
0x202f       NARROW NO-BREAK SPACE   h s
0x205f   MEDIUM MATHEMATICAL SPACE   h s
0x3000           IDEOGRAPHIC SPACE   h s
EOTable

  my $class;
  while (/^0x([0-9a-f]{4})\s+([A-Z\s]+)/mg) {
    my($hex,$name) = ($1,$2);
    next if $name =~ /\b(?:CR|NL|NEL|SEPARATOR)\b/;
    $class .= "\\N{U+$hex}";
  }

  qr/[$class]/u;
}

Other Applications

The double-negative trick is also handy for matching alphabetic characters too. Remember that \w matches “word characters,” alphabetic characters and digits and underscore. We ugly-Americans sometimes want to write it as, say,

if (/[A-Za-z]+/) { ... }

but a double-negative character-class can respect the locale:

if (/[^\W\d_]+/) { ... }

Expressing “a word character but not digit or underscore” this way is a bit opaque. A POSIX character-class communicates the intent more directly

if (/[[:alpha:]]+/) { ... }

or with a Unicode property as szbalint suggested

if (/\p{Letter}+/) { ... }
Greg Bacon
  • 134,834
  • 32
  • 188
  • 245
  • 4
    Clever, but the behavior is very surprising, and I don't see how it's less awkward. – Qwertie Aug 12 '10 at 16:04
  • 8
    @Qwertie: what's surprising? Less awkward than what? – ysth Aug 12 '10 at 16:06
  • How can I nest this expression within another one? E.g. replace "`\s`" with it in `/(\+|0|\()[\d()\s-]{6,20}\d/g`? Thx – Pingui Aug 17 '14 at 17:52
  • 3
    This will certainly meet the needs of the OP and virtually everyone else who searches out this question (English speakers, anyway). But it's still a bad answer. There's simply no excuse for using this solution when `\h` is available. – Alan Moore Jul 06 '16 at 11:44
  • 1
    In Python, make sure you use this with `flags=re.UNICODE`. – Carson Ip Jun 10 '19 at 04:17
  • 1
    VSCode Find and Replace doesn't support \h probably because it's something other than PCRE, but this [nice] answer worked for me, thanks. – aderchox Apr 05 '20 at 18:13
  • If anyone is using VBScript.regexp and is fed with a weird mix of spaces, you may wish to list out all spaces instead with `/[\x09\x20\xA0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u202F\u205F\u3000]/` as the proposed solution which should work doesn't worked for me – R. Wang Oct 12 '22 at 02:56
231

Perl versions 5.10 and later support subsidiary vertical and horizontal character classes, \v and \h, as well as the generic whitespace character class \s

The cleanest solution is to use the horizontal whitespace character class \h. This will match tab and space from the ASCII set, non-breaking space from extended ASCII, or any of these Unicode characters

U+0009 CHARACTER TABULATION
U+0020 SPACE
U+00A0 NO-BREAK SPACE (not matched by \s)

U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

The vertical space pattern \v is less useful, but matches these characters

U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0085 NEXT LINE (not matched by \s)

U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR

There are seven vertical whitespace characters which match \v and eighteen horizontal ones which match \h. \s matches twenty-three characters

All whitespace characters are either vertical or horizontal with no overlap, but they are not proper subsets because \h also matches U+00A0 NO-BREAK SPACE, and \v also matches U+0085 NEXT LINE, neither of which are matched by \s

Borodin
  • 126,100
  • 9
  • 70
  • 144
  • 10
    `\h` works only on the languages which supports `PCRE` . – Avinash Raj Sep 21 '14 at 17:01
  • 18
    @AvinashRaj: This question is about Perl, which certainly supports PCRE – Borodin Sep 21 '14 at 22:36
  • @AleksandrDubinsky this blank POSIX notation `[[:blank:]]` will work on most of the languages. – Avinash Raj Dec 26 '14 at 04:17
  • 2
    @AvinashRaj: Except that `[[:blank:]]` doesn't match no-break space -- ` ` or `"\xA0"` – Borodin Jan 19 '15 at 16:51
  • 6
    Wanna mention that `\h` worked perfectly for my use case which was doing a find/replace in Notepad++ on 1 or more contiguous non-new-line spaces. Nothing else (simple) worked. – squidbe Mar 10 '15 at 20:35
  • 1
    [ICU has` \h`](http://userguide.icu-project.org/strings/regexp) so this is pretty standard. – adib Dec 13 '15 at 09:09
  • 2
    @Borodin POSIX `blank` should match `NO-BREAK SPACE` in any engine that supports Unicode regular expressions. It is defined in [Annex C: Compatibility Properties of Unicode Regular Expressions](http://www.unicode.org/reports/tr18/#blank) – Aleksandr Dubinsky Feb 03 '16 at 17:53
  • 9
    What makes Perl's `\h` slightly non-standard is its inclusion of `MONGOLIAN VOWEL SEPARATOR`. Unicode does not consider it whitespace. For that reason, Perl `\h` differs from POSIX `blank` (`[[:blank:]]` in Perl, `\p{Blank}` in Java) and Java 8 `\h`. Admittedly, it's an edge case. – Aleksandr Dubinsky Feb 03 '16 at 18:07
  • 1
    For more information on what Unicode considers whitespace (and what it doesn't), see the table in https://en.wikipedia.org/wiki/White-space_character – Aleksandr Dubinsky Feb 03 '16 at 18:08
  • A table of which regex engines support `\h` and POSIX `blank`: http://www.regular-expressions.info/refcharclass.html – Aleksandr Dubinsky Feb 03 '16 at 19:24
  • @AleksandrDubinsky: It looks like Perl has been fixed with regard to including `MONGOLIAN VOWEL SEPARATOR` in `\h`. See my revised solution above. I can't see it mentioned in any of the fixes so I can't offer a safe version number, but I will keep looking – Borodin Jul 06 '16 at 12:10
  • Why do I get `Unrecognized escape \h passed through` when I try to use this? – Marcus Aug 28 '18 at 14:15
  • I'm on Perl 5.16.3, regarding `Unrecognized escape \h passed through`. Why? – Marcus Aug 28 '18 at 14:24
  • In atom editor, \h+ can match spaces correctly, but I cannot replace it with a comma for example somehow. Used [^\S\r\n]+ from the answer below eventually. – fstang Mar 29 '19 at 10:53
  • `bad escape \h ` on python :( – john k Dec 15 '21 at 20:39
63

A variation on Greg’s answer that includes carriage returns too:

/[^\S\r\n]/

This regex is safer than /[^\S\n]/ with no \r. My reasoning is that Windows uses \r\n for newlines, and Mac OS 9 used \r. You’re unlikely to find \r without \n nowadays, but if you do find it, it couldn’t mean anything but a newline. Thus, since \r can mean a newline, we should exclude it too.

Community
  • 1
  • 1
Rory O'Kane
  • 29,210
  • 11
  • 96
  • 131
  • 1
    +1 [Greg's solution](http://stackoverflow.com/a/3469155/175071) ended up corrupting my text, yours worked fine. – Timo Huovinen Jan 31 '14 at 10:46
  • You might be surprised at how many programs still use "\r" for line endings. It sometimes took me a while to figure out that my problem was that the file used these. Or that it used the MacRoman character encoding... – mivk Feb 13 '14 at 20:20
  • 7
    looks like @Greg first had it "wrong" changed it and did not credit you. Thats why im upvoting here. – Andre Elrico Mar 31 '20 at 10:08
18

The below regex would match white spaces but not of a new line character.

(?:(?!\n)\s)

DEMO

If you want to add carriage return also then add \r with the | operator inside the negative lookahead.

(?:(?![\n\r])\s)

DEMO

Add + after the non-capturing group to match one or more white spaces.

(?:(?![\n\r])\s)+

DEMO

I don't know why you people failed to mention the POSIX character class [[:blank:]] which matches any horizontal whitespaces (spaces and tabs). This POSIX chracter class would work on BRE(Basic REgular Expressions), ERE(Extended Regular Expression), PCRE(Perl Compatible Regular Expression).

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
17

What you are looking for is the POSIX blank character class. In Perl it is referenced as:

[[:blank:]]

in Java (don't forget to enable UNICODE_CHARACTER_CLASS):

\p{Blank}

Compared to the similar \h, POSIX blank is supported by a few more regex engines (reference). A major benefit is that its definition is fixed in Annex C: Compatibility Properties of Unicode Regular Expressions and standard across all regex flavors that support Unicode. (In Perl, for example, \h chooses to additionally include the MONGOLIAN VOWEL SEPARATOR.) However, an argument in favor of \h is that it always detects Unicode characters (even if the engines don't agree on which), while POSIX character classes are often by default ASCII-only (as in Java).

But the problem is that even sticking to Unicode doesn't solve the issue 100%. Consider the following characters which are not considered whitespace in Unicode:

The aforementioned Mongolian vowel separator isn't included for what is probably a good reason. It, along with 200C and 200D, occur within words (AFAIK), and therefore breaks the cardinal rule that all other whitespace obeys: you can tokenize with it. They're more like modifiers. However, ZERO WIDTH SPACE, WORD JOINER, and ZERO WIDTH NON-BREAKING SPACE (if it used as other than a byte-order mark) fit the whitespace rule in my book. Therefore, I include them in my horizontal whitespace character class.

In Java:

static public final String HORIZONTAL_WHITESPACE = "[\\p{Blank}\\u200B\\u2060\\uFFEF]"
Community
  • 1
  • 1
Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99
  • You need to add the appropriate regexp compile flags to the Java compilation, and be running Java 7 or later. In any event, the question was not about Java or PCRE at all, so this is all immaterial. – tchrist Sep 21 '14 at 05:16
  • @tchrist Thank you for pointing this out. I will update my answer. I disagree, though, that my answer is irrelevant. What is immaterial is the `perl` tag in the original question. – Aleksandr Dubinsky Sep 21 '14 at 09:58
  • 1
    @AleksandrDubinsky, \p{Blank} is not supported in JavaScript, so definitely not "standard to all regex flavors" -1 – Valentin V Apr 24 '15 at 08:52
  • Most informative. I find it disturbing to know that a general and complete "horizontal whitespace" shorthand character class does not exist, and that horrors like `[\p{Blank}\u200b\u180e]` are required. Admittedly, it makes sense that a vowel separator is *not* considered a whitespace character, but why zero-width space is not in classes like `\s` and `\p{Blank}`, beats me. – Timo Jul 13 '15 at 13:25
  • Follow-up: I read that both are considered 'boundary neutral', although that doesn't explain *why*. – Timo Jul 13 '15 at 13:33
4

Put the regex below in the find section and select Regular Expression from "Search Mode":

[^\S\r\n]+
Hasan Zafari
  • 355
  • 2
  • 6
-4

m/ /g just give space in / /, and it will work. Or use \S — it will replace all the special characters like tab, newlines, spaces, and so on.

Amal Murali
  • 75,622
  • 18
  • 128
  • 150