31

How to split text into words?

Example text:

'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'

The words in that line are:

  1. Oh
  2. you
  3. can't
  4. help
  5. that
  6. said
  7. the
  8. Cat
  9. we're
  10. all
  11. mad
  12. here
  13. I'm
  14. mad
  15. You're
  16. mad
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
  • 4
    My advice: begin by defining an unambiguous lexical grammar and then write a lexer for that grammar that produces a sequence of tokens. Then reject the tokens that are not lexed into the "word" production. This isn't a job for regular expressions. – Eric Lippert May 24 '13 at 02:20
  • I really like Eric's response. I know I'm a little late to the party, but it's the best way to go. – Maurice Reeves May 24 '13 at 12:45
  • i've collect all the **delimiter** above and I found something like this result.Split({ " '" , " " , ",'" , ": '" , "." , ".'" }, StringSplitOptions.RemoveEmptyEntries); – Ramgy Borja Aug 02 '17 at 05:57

7 Answers7

59

Split text on whitespace, then trim punctuation.

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));

Agrees exactly with example.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
26

First, Remove all special characeters:

var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better

Then split it:

var split = fixedInput.Split(' ');

For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):

public static string RemoveSpecialCharacters(this string str) {
   var sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

Then use it like so:

var words = input.RemoveSpecialCharacters().Split(' ');

You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)

Update

I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:

(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')

With:

char.IsLetter(c)

Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases

Adam Tal
  • 5,911
  • 4
  • 29
  • 49
  • I don't think digits are a part of a word --but I guess that is up to the OP – Hogan May 24 '13 at 00:07
  • I geuss it is up to him, he can change the regex as he wishes. – Adam Tal May 24 '13 at 00:08
  • 2
    The only issue i see is your solution will trim the apostrophe off of contractions. Ex. changing "isn't" to "isnt" – Michael La Voie May 24 '13 at 00:08
  • Yap, I saw it too, and while you wrote your comment, I improved my solution. – Adam Tal May 24 '13 at 00:11
  • Seems like he's just looking for a quick word count, s.Split(' ').Length – Chris Moschini May 24 '13 at 00:12
  • Whereas that works for English text, it's not a good solution for Unicode text in general. When you step into general Unicode text, the character-by-character technique you use in the extension method can fail because it doesn't take into account combining characters and such. – Jim Mischel May 24 '13 at 01:40
  • By the way, if you're going to the trouble of parsing character-by-character, you might as well build the array while you're at it. It'd be just a few more lines of code. – Jim Mischel May 24 '13 at 01:51
  • @JimMischel, you could replace the `foreach` loop with a `StringInfo.GetTextElementEnumerator` to make it more Unicode-friendly, though I imagine considering the Unicode case also requires a much wider set of "special" characters to remove. – nicholas May 24 '13 at 01:54
  • @JimMischel, As I agree to the Unicode problem you've stated, I updated my answer to shoe how you can support those cases. – Adam Tal May 24 '13 at 05:33
  • Wouldn't your extension method remove the spaces also so there would be nothing to split? You would just get a long concatenated string of words. – Vladimir Apr 06 '16 at 18:47
8

Just to add a variation on @Adam Fridental's answer which is very good, you could try this Regex:

var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";

var matches = Regex.Matches(text, @"\w+[^\s]*\w+|\w");

foreach (Match match in matches) {
    var word = match.Value;
}

I believe this is the shortest RegEx that will get all the words

\w+[^\s]*\w+|\w
wake-0
  • 3,918
  • 5
  • 28
  • 45
Michael La Voie
  • 27,772
  • 14
  • 72
  • 92
  • 1
    Nice. But as I stated in my answer there is one thing that's problematic when solving this with regex - the time it takes, I've checked and the extension method I wrote in my answer is ~ X7 faster then the regular expression parsing. – Adam Tal May 24 '13 at 00:17
  • 1
    Thanks for profiling them, I learned something new today :) You have my upvote. I'd keep arguing (as is my nature) for Regex to reduce code complexity, but your method is pretty short too and most people don't find regex as friendly as i do. oh well. – Michael La Voie May 24 '13 at 00:28
  • 1
    I agree that Refex is great. When you have a second to wait :) – Adam Tal May 24 '13 at 00:30
  • 1
    This is the better solution, in general, because it will handle any Unicode word character. You can modify it to handle apostrophes and digits, as well, if that's required. Although the apostrophe can be decidedly difficult to handle correctly. The faster method is nice if you can guarantee English text, but it fails horribly otherwise. – Jim Mischel May 24 '13 at 01:47
1

If you don't want to use a Regex object, you could do something like...

string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();

You'll still have to handle the trailing apostrophe at the end of "that,'"

mason
  • 31,774
  • 10
  • 77
  • 121
1

This is one of solution, i dont use any helper class or method.

        public static List<string> ExtractChars(string inputString) {
            var result = new List<string>();
            int startIndex = -1;
            for (int i = 0; i < inputString.Length; i++) {
                var character = inputString[i];
                if ((character >= 'a' && character <= 'z') ||
                    (character >= 'A' && character <= 'Z')) {
                    if (startIndex == -1) {
                        startIndex = i;
                    }
                    if (i == inputString.Length - 1) {
                        result.Add(GetString(inputString, startIndex, i));
                    }
                    continue;
                }
                if (startIndex != -1) {
                    result.Add(GetString(inputString, startIndex, i - 1));
                    startIndex = -1;
                }
            }
            return result;
        }

        public static string GetString(string inputString, int startIndex, int endIndex) {
            string result = "";
            for (int i = startIndex; i <= endIndex; i++) {
                result += inputString[i];
            }
            return result;
        }
toannm
  • 485
  • 4
  • 9
1

If you want to use the "for cycle" to check each char and save all punctuation in the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]

public class SentenceSplitResult
{
    public string word;
    public bool isAWord;
}

public class StringsHelper
{

    private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();

    private readonly string input;

    public StringsHelper(string input)
    {
        this.input = input;
    }

    public List<SentenceSplitResult> GetSplitSentence()
    {
        StringBuilder sb = new StringBuilder();

        try
        {
            if (String.IsNullOrEmpty(input)) {
                Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
                return outputList;                    
            }

            bool isAletter = IsAValidLetter(input[0]);

            // Each char i checked if is a part of a word.
            // If is YES > I can store the char for later
            // IF is NO > I Save the word (if exist) and then save the punctuation
            foreach (var _char in input)
            {
                isAletter = IsAValidLetter(_char);

                if (isAletter == true)
                {
                    sb.Append(_char);
                }
                else
                {
                    SaveWord(sb.ToString());
                    sb.Clear();
                    SaveANotWord(_char);                        
                }
            }

            SaveWord(sb.ToString());

        }
        catch (Exception ex)
        {
            Logger.Log(ex);
        }

        return outputList;

    }

    private static bool IsAValidLetter(char _char)
    {
        if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
        {
            return false;
        }
        return true;
    }

    private void SaveWord(string word)
    {
        if (String.IsNullOrEmpty(word) == false)
        {
            outputList.Add(new SentenceSplitResult()
            {
                isAWord = true,
                word = word
            });                
        }
    }

    private void SaveANotWord(char _char)
    {
        outputList.Add(new SentenceSplitResult()
        {
            isAWord = false,
            word = _char.ToString()
        });
    }
Francesco
  • 47
  • 1
  • 4
0

You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't) but remove the single quotes like in 'Oh.

string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";

Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");

string[] listOfWords = RemoveCharacters(myText);

public string[] RemoveCharacters(string input)
{
    StringBuilder sb = new StringBuilder();
    foreach (char c in input)
    {
        if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
           sb.Append(c);
    }

    return sb.ToString().Split(' ');
}
keyboardP
  • 68,824
  • 13
  • 156
  • 205