3

Introduction/Question:

I have been studying the use of Regular Expressions (using VBA/Excel), and so far I cannot understand how I would isolate a <space> (or " ") using regexp from other white space characters that are included in \s. I thought that I would be able to use \p{Zs}, but in my testing so far, it has not worked out. Could someone please correct my misunderstanding? I appreciate any helpful input.

To offer proper credit, I modified some code that started off as a very helpful post by @Portland Runner that is found here: How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

This has been my approach/study so far:

Using the string "14z-16z Flavored Peanuts", I've been trying to write a RegExp which removes "14z-16z " and leaves only "Flavored Peanuts". I initially used ^[0-9](\S)+ as strPattern and a sub procedure with following snippets:

Sub REGEXP_TEST_SPACE()

Dim strPattern As String
Dim strReplace As String
Dim strInput As String
Dim regEx As New RegExp

strInput = "14z-16z Flavored Peanuts"
strPattern = "^[0-9](\S)+"
strReplace = ""

With regEx
    .Global = True
    .MultiLine = True
    .IgnoreCase = True
    .pattern = strPattern
End With

If regEx.Test(strInput) Then
    Range("A1").Value = regEx.Replace(strInput, strReplace)
End If

End Sub

This approach gave me an A1 value of " Flavored Peanuts" (note the leading <space> in that string).

I then changed strPattern = "^[0-9](\S)+(\s)" (added the (\s)), which gave me the desired A1 value of "Flavored Peanuts". Great!!! I got the desired output!

But as I understand it, \s represents all white-space characters, equal to [ \f\n\r\t\v]. In this case, I know that the character is just a normal, single space -- I don't need carriage return, horizontal tab, etc. So I tried to see if I could just isolate the <space> character in regex (unicode separator: space), which I believe is \p{Zs} (e.g., strPattern = "^[0-9](\S)+(\p{Zs})"). Using this pattern, however, doesn't return a match whatsoever, nevermind removing the leading space. I also tried the more general \p{Z} (all unicode separators), but that didn't work either.

Clearly I have missed something in my study. Help is both desired and appreciated. Thank you.

Community
  • 1
  • 1
ScottyJ
  • 945
  • 11
  • 16
  • A space is a space. Use `[ ]` (with or without brackets) to match a literal space. Hex `\x20` [ascii](http://www.asciitable.com/) and [utf-8](http://www.fileformat.info/info/unicode/char/0020/index.htm). The [negation](http://www.regular-expressions.info/charclass.html#negated) `[^ ]` matches a character, that is not a literal space :) – Jonny 5 Feb 19 '15 at 22:28
  • 1
    Thank you, Jonny 5. Your answer is helpful. See comment under @phrebh's answer below. – ScottyJ Feb 20 '15 at 14:06

3 Answers3

4

Since you are trying to find a correspondence with the \p{Zs} Unicode category class, you might want to also handle all hard spaces. This code will be helpful:

strPattern = "^[0-9](\S)+[ " & ChrW(160) & "]"

Or,

strPattern = "^[0-9](\S+)[ \x0A]"

The [ \x0A] character class will match either a regular space or a hard, non-breaking space.

If you need to match all kinds of spaces, you can use this regex pattern taken based on the information on https://www.cs.tut.fi/~jkorpela/chars/spaces.html:

strPattern = "^[0-9](\S)+[ \xA0\u1680\u180E\u2000-\u200B\u202F\u205F\u3000\uFEFF]"

This is the table with code point explanations:

U+0020  32  SPACE   foo bar Depends on font, typically 1/4 em, often adjusted
U+00A0  160 NO-BREAK SPACE  foo bar As a space, but often not adjusted
U+1680  5760    OGHAM SPACE MARK    foo bar Unspecified; usually not really a space but a dash
U+180E  6158    MONGOLIAN VOWEL SEPARATOR   foo᠎bar No width
U+2000  8192    EN QUAD foo bar 1 en (= 1/2 em)
U+2001  8193    EM QUAD foo bar 1 em (nominally, the height of the font)
U+2002  8194    EN SPACE    foo bar 1 en (= 1/2 em)
U+2003  8195    EM SPACE    foo bar 1 em
U+2004  8196    THREE-PER-EM SPACE  foo bar 1/3 em
U+2005  8197    FOUR-PER-EM SPACE   foo bar 1/4 em
U+2006  8198    SIX-PER-EM SPACE    foo bar 1/6 em
U+2007  8199    FIGURE SPACE    foo bar “Tabular width”, the width of digits
U+2008  8200    PUNCTUATION SPACE   foo bar The width of a period “.”
U+2009  8201    THIN SPACE  foo bar 1/5 em (or sometimes 1/6 em)
U+200A  8202    HAIR SPACE  foo bar Narrower than THIN SPACE
U+200B  8203    ZERO WIDTH SPACE    foo​bar Nominally no width, but may expand
U+202F  8239    NARROW NO-BREAK SPACE   foo bar Narrower than NO-BREAK SPACE (or SPACE)
U+205F  8287    MEDIUM MATHEMATICAL SPACE   foo bar 4/18 em
U+3000  12288   IDEOGRAPHIC SPACE   foo bar The width of ideographic (CJK) characters.
U+FEFF  65279   ZERO WIDTH NO-BREAK SPACE

Best regards.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Is the method I'm attempting (\p{Zs}) not valid in VBA? You mention C#. I got that method from a MS site referencing .NET and thought it was valid for VBA, too, but it sounds like I'm crossing languages. I just want to clearly understand why \p{Zs} does not work. I understand now about the literals that I can/should use. – ScottyJ Feb 20 '15 at 00:03
  • The named character class \p{Zs} is only valid for .NET regex flavor. VBA has nothing to do with MS .NET framework, it utilizes Microsoft VBScript Regular Expressions library that does not support namded character classes. There are other limitations, please see regular-expressions.info/vb.html, and if you have time http://www.experts-exchange.com/Programming/Languages/Visual_Basic/A_1336-Using-Regular-Expressions-in-Visual-Basic-for-Applications-and-Visual-Basic-6.html. Regex flavors really differ :( – Wiktor Stribiżew Feb 20 '15 at 07:35
  • This one worked for me in Word/Excel: [ " & ChrW(160) & "] – Andrew Cowenhoven Mar 17 '21 at 13:14
2

You can explicitly include a white space in your RegEx pattern. The following pattern works just fine

strPattern = "^[0-9](\S)+ "
Jeanno
  • 2,769
  • 4
  • 23
  • 31
1

Just use a literal space character: strPattern = "^[0-9](\S)+ ".

phrebh
  • 159
  • 2
  • 4
  • 13
  • Thank you, phrebh. I will end up using your (and @Jeanno's and @Jonny 5) suggestion about the literals, which hadn't occurred to me, but I accepted @stribizhev's solution because I wanted to also understand why the approach I was trying did not work, specifically with the `\p{Zs}` which I mistakenly lifted from .NET. I appreciate everyone trying to help me. Thank you. – ScottyJ Feb 20 '15 at 13:46