15

I have seen a few similar questions but I am trying to achieve this.

Given a string, str="The moon is our natural satellite, i.e. it rotates around the Earth!" I want to extract the words and store them in an array. The expected array elements would be this.

the 
moon 
is 
our 
natural 
satellite 
i.e. 
it  
rotates 
around 
the 
earth

I tried using String.split( ','\t','\r') but this does not work correctly. I also tried removing the ., and other punctuation marks but I would want a string like "i.e." to be parsed out too. What is the best way to achieve this? I also tried using regex.split to no avail.

string[] words = Regex.Split(line, @"\W+");

Would surely appreciate some nudges in the right direction.

Richard N
  • 895
  • 9
  • 19
  • 36

4 Answers4

39

A regex solution.

(\b[^\s]+\b)

And if you really want to fix that last . on i.e. you could use this.

((\b[^\s]+\b)((?<=\.\w).)?)

Here's the code I'm using.

  var input = "The moon is our natural satellite, i.e. it rotates around the Earth!";
  var matches = Regex.Matches(input, @"((\b[^\s]+\b)((?<=\.\w).)?)");

  foreach(var match in matches)
  {
     Console.WriteLine(match);
  }

Results:

The
moon
is
our
natural
satellite
i.e.
it
rotates
around
the
Earth
TheCodeKing
  • 19,064
  • 3
  • 47
  • 70
9

I suspect the solution you're looking for is much more complex than you think. You're looking for some form of actual language analysis, or at a minimum a dictionary, so that you can determine whether a period is part of a word or ends a sentence. Have you considered the fact that it may do both?

Consider adding a dictionary of allowed "words that contain punctuation." This may be the simplest way to solve your problem.

Greg D
  • 43,259
  • 14
  • 84
  • 117
  • Regex does this with `\b` so you don't have to, admittedly there are some grey areas. For instance `i.e.` with match as `i.e`. – TheCodeKing Sep 05 '11 at 19:23
1

This works for me.

var str="The moon is our natural satellite, i.e. it rotates around the Earth!";
var a = str.Split(new char[] {' ', '\t'});
for (int i=0; i < a.Length; i++)
{
    Console.WriteLine(" -{0}", a[i]);
}

Results:

 -The
 -moon
 -is
 -our
 -natural
 -satellite,
 -i.e.
 -it
 -rotates
 -around
 -the
 -Earth!

you could do some post-processing of the results, removing commas and semicolons, etc.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Cheeso
  • 189,189
  • 101
  • 473
  • 713
1
Regex.Matches(input, @"\b\w+\b").OfType<Match>().Select(m => m.Value)
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125