2

Prelude

I am trying to perform an operation which requires me to parse every individual word a particular file. The most straightforward way of doing this would be to load the text using the:

$content = Get-Content -Path .\<filename>

Then I will break every individual word into an individual line (this allows me to do a word count AND single word search very quickly). The problem is when I then use this line of code:

$content.split("\s+")

which should create a new line (split) on every (one or more) whitespace character. Unfortunately, my results look like this:

$content.split("\s+")
The SpeechSynthe
izer cla

provide
acce

 to the functionality of a 
peech 
ynthe
 i
  engine that i
  in
talled on the ho
t computer. In
talled 
peech 
ynthe
 i
 engine

But when I run

$content -split("\s+")

The results will come out correctly:

$content -split("\s+")
The
SpeechSynthesizer
class
provides
access
to
the
functionality
of
a
speech
synthesis

My question Using powershell V.4 I am having trouble understanding what the difference between performing the operation.

$content.split("\s+")

and

$content -split("\s+")

is. And why they are outputting different results.

Is that functionality just broken?

Is there some other difference that I am not aware of at play here?

1 Answers1

4

See Powershelladmin wiki:

The -split operator takes a regular expression, and to split on an arbitrary amount of whitespace, you can use the regexp "\s+".

And

To split on a single, or multiple, characters, you can also use the System.String object method Split().

PS C:\> 'a,b;c,d'.Split(',') -join ' | '
a | b;c | d
PS C:\> 'a,b;c,d'.Split(',;') -join ' | '
a | b | c | d

So, you just passed the symbols you need to split against with $content.split("\s+"), not the regex to match whitespace.

In $content -split("\s+"), \s+ is a regex pattern matching 1 or more whitespace symbols.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • So you're saying that the System.String method does not support regex, and therefore, "\s+" does not mean *one or more whitespace* in the context of '$content.split("\s+")' Correct? – Get-HomeByFiveOClock Oct 01 '15 at 14:17
  • 1
    Note it is not just that I am saying, but your example speaks for itself: all `s` letters were split against. *Synthe**s**izer*, etc. Your `$content.split("\s+")` splits by `s` and `+` (I guess ``\`` is considered as a wrong escape symbol and is ignored). Try yourself. – Wiktor Stribiżew Oct 01 '15 at 14:20
  • Now that you point that out; I can see now that is exactly what it is splitting against! I also tested against ' $content.split(" ") ' and it works exactly as expected. Thank you sir! – Get-HomeByFiveOClock Oct 01 '15 at 14:25
  • I just came across [this post](http://stackoverflow.com/questions/29459813/is-there-a-way-to-escape-a-string-in-powershell-like-string-in-c-sharp), it says that all strings are verbatim string literals, so, `$content.split("\s+")` should also split by ``\``, too. – Wiktor Stribiżew Oct 01 '15 at 14:29
  • 1
    I just tested. Confirmed your above statement @stribizhev. the "\" character will also split the line. – Get-HomeByFiveOClock Oct 01 '15 at 14:58