How do I split a phrase into words using Regex in C#

Question

I am trying to split a sentence/phrase in to words using Regex.

var phrase = "This isn't a test.";
var words = Regex.Split(phrase, @"\W+").ToList();

words contains "This", "isn", "t", "a", "test"

Obviously it's picking up the apostrophe and splitting on that. Can I change this behavior? It also needs to be multilingual supporting a variety of languages (Spanish, French, Russian, Korean, etc...).

I need to pass the words in to a spellchecker. Specifically Nhunspell.

return (from word in words let correct = _engine[langId].Spell(word) where !correct select word).ToList();

Try splitting on spaces instead? Do you have a good sample of use-cases to demonstrate what this Regex needs to handle? — mellamokb, Apr 20 '12 at 02:41
I'm passing the words in to a spellchecker, so I need to lose the punctuation. — Dean, Apr 20 '12 at 02:45
Since you want to split for a number of different languages, you'll need to use a tokenizer which understands said languages. In your example, isn't clearly is a word, but in another language the ' might normally not part of the word. Most spell checking libraries thus come with a Tokenizer or Parser which can do this job for you. — jessehouwing, Apr 20 '12 at 09:28
Chinese for example doren't have spaces at all: http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html — jessehouwing, Apr 20 '12 at 09:35

score 11 · Accepted Answer · answered Apr 20 '12 at 04:07

If you want to split into words for spell checking purposes, this is a good solution:

new Regex(@"[^\p{L}]*\p{Z}[^\p{L}]*")

Basically you can use Regex.Split using the previous regex. It uses unicode syntax so it would work in several languages (not for most asian though). And it won't break words with apostrophes ot hyphens.

jessehouwing · Answer 2 · 2022-01-31T12:19:19.837

Due to the fact that a number of languages use very complex rules to string words together into phrases and sentences, you can't rely on a simple Regular Expression to get all the words from a piece of text. Even for a language as 'simple' as English you'll run in a number of corner cases such as:

How to handle words like you're, isn't where there's two words combined and a number of characters replaces with '.
How to handle abbreviations such as Mr. Mrs. i.e.
combined words using '-'
hyphenated words at the end of a sentence.
Names like O'Brian and O'Connel.

Chinese and Japanese (among others) are notoriously hard to parse this way, as these languages do not use spaces between words, only between sentences.

You might want to read up on Text Segmentation and if the segmentation is important to you invest in a Spell Checker that can parse a whole text or a Text Segmentation engine which can split your sentences up into words according to the rules of the language.

I couldn't find a .NET based multi-lingual segmentation engine with a quick google search though. Sorry.

Jack · Answer 3 · 2012-04-20T03:16:55.367

3

Use Split().

words = phrase.Split(' ');

Without punctuation.

words = phrase.Split(new Char [] {' ', ',', '.', ':', , ';', '!', '?', '\t'});

edited Apr 20 '12 at 03:16

answered Apr 20 '12 at 02:41

Jack

5,680
10
49
74

score 1 · Answer 4 · answered Apr 20 '12 at 02:42

1

What do you want to split on? Spaces? Punctuation? You have to decide what the stop characters are. A simple regex that uses space and a few punctuation characters would be "[^.?!\s]+". That would split on period, question mark, exclamation, and any whitespace characters.

answered Apr 20 '12 at 02:42

Jim Mischel

131,090
20
188
351

I also need to consider Spanish which will have exclamations and questions that are upside down. – Dean Apr 20 '12 at 02:47
Then add those characters to the list of characters inside the `[]` and after the `^`. So, for example `"[^.?!¿¡\s]"`. You'll probably want to add parentheses, comma, semicolon, and many other punctuation characters. That list is the characters that you *don't want* in your words. The `^` at the start means "not these characters." So you'll need to add the caret (^) character to the list, too. – Jim Mischel Apr 20 '12 at 02:50
Ok, I'm going to see what I can do about getting a list of punctuation. I like this approach. – Dean Apr 20 '12 at 02:56

David Z. · Answer 5 · 2012-04-20T03:36:10.847

You can try if you're trying to split based on spaces only.

var words = Regex.Split(phrase, @"[^ ]+").ToList();

The other approach is to add the apostrophe by adding that to your character class.

var words = Regex.Split(phrase, @"(\W|')+").ToList();

Otherwise, is there a specific reason that you cannot use string.Split()? This would seem much more straightforward. Also, you would also be able to pass in other punctuation characters (i.e. split on . as well as spaces).

var words = phrase.Split(' ');
var words = phrase.Split(new char[] {' ', '.'});

score 0 · Answer 6 · answered Apr 20 '12 at 02:42

0

It doesn't really seem like you need a regex. You could just do:

phrase.Split(" ");

answered Apr 20 '12 at 02:42

Michael Frederick

16,664
3
43
58

Only if you want punctuation in your words. – Jim Mischel Apr 20 '12 at 02:54

score 0 · Answer 7 · answered Apr 20 '12 at 03:51

I'm not a java person but you could try to exclude punctuation while splitting on
spaces at the same time. Something like this maybe.

These are raw and expanded regexes, the words are in capture group 1.
Do a global search.

Unicode (doesen't account for grapheme's)

[\s\pP]* ([\pL\pN_-] (?: [\pL\pN_-] | \pP(?=[\pL\pN\pP_-]) )* )

Ascii

[\s[:punct:]]* (\w (?: \w | [[:punct:]](?=[\w[:punct:]]) )* )

score 0 · Answer 8 · answered Jul 31 '13 at 16:47

0

This worked for me: [^(\d|\s|\W)]*

answered Jul 31 '13 at 16:47

maiconmm

311
4
16

How do I split a phrase into words using Regex in C#

8 Answers8

Linked