8

What would be the best regular expression for tokenizing an English text?

By an English token, I mean an atom consisting of maximum number of characters that can be meaningfully used for NLP purposes. An analogy is a "token" in any programming language (e.g. in C, '{', '[', 'hello', '&', etc. can be tokens). There is one restriction: Though English punctuation characters can be "meaningful", let's ignore them for the sake of simplicity when they do not appear in the middle of \w+. So, "Hello, world." yields 'hello' and 'world'; similarly, "You are good-looking." may yield either [you, are, good-looking] or [you, are, good, looking].

Costique
  • 23,712
  • 4
  • 76
  • 79
OTZ
  • 3,003
  • 4
  • 29
  • 41
  • See [this question](http://stackoverflow.com/questions/992176/c-tokenize-a-string-using-a-regular-expression) about tokening in C++ using Roost.Regex. – Lazer Sep 13 '10 at 20:00
  • 1
    possible duplicate of [True definition of an English word?](http://stackoverflow.com/questions/3690195/true-definition-of-an-english-word) – Daniel Vandersluis Sep 13 '10 at 20:08
  • @OTZ in English what is a token if not a word? – Daniel Vandersluis Sep 13 '10 at 20:13
  • @Paul that's a technicality ;) According to regex's `\w`, numbers are words anyways! :) – Daniel Vandersluis Sep 13 '10 at 20:17
  • @Vandersluis token != word. 'jiojfe909j94398aija' is a token, even though it is not a word. – OTZ Sep 13 '10 at 20:18
  • @OTZ no need to be rude. I never said a "word" (in this context) had to appear in a dictionary. – Daniel Vandersluis Sep 13 '10 at 20:18
  • @Daniel: I think that's a bit of a non-statement. \w actually doesn't cover the set of English words in the dictionary or grammatically correct English words. So something a bit more fundamental is needed. – Paul Nathan Sep 13 '10 at 20:19
  • 2
    @OTZ: C has a formal specification. English has no such specification. *You* have to provide the specification of what you want. We can't guess what you are thinking. – Mark Byers Sep 13 '10 at 20:19
  • @Vandersluis But you know the difference right? An English word is not some base64 string, but an English token can be any \w+ and more. – OTZ Sep 13 '10 at 20:21
  • http://books.google.com/books?id=fZmj5UNK8AQC&lpg=PA70&ots=LqWc__MGMD&dq=3.22%20tokenization%20speech%20and%20language%20processing&pg=PA71#v=onepage&q&f=false – anno Sep 13 '10 at 20:23
  • @Byers Added the definition of an English token. Let me know if it does not make sense to you. – OTZ Sep 13 '10 at 20:27
  • @OTZ: Perhaps you should try to explain in more detail what you need this regular expression for. What is the context? How will it be used? Do you really need the "best" solution or are you just looking for a quick hack that will work on a small set of data that you are studying? – Mark Byers Sep 13 '10 at 20:30
  • 3
    You need to be more specific about what you want to consider a token. Should spaces be tokens? Punctuation marks? There are limitations to what you can do with a regular expression (e.g., distinguishing between `'` used as an apostrophe versus a single quotation mark). – Adrian McCarthy Sep 13 '10 at 20:35
  • Tokenization in Perl. http://www.essex.ac.uk/linguistics/research/resgroups/clgroup/Resources/Nugues/CountingWords/tokenize.perl.html – anno Sep 13 '10 at 20:44
  • @Byers, @McCarthy Right. Edited again with some restriction to make it simpler. Whether 'good-looking' should be a single token or two tokens is an interesting question. – OTZ Sep 13 '10 at 20:47

4 Answers4

5

Treebank Tokenization

Penn Treebank (PTB) tokenization is a reasonably common tokenization scheme used for natural language processing (NLP) work.

You can find a sed script with the appropriate regular expressions to get this tokenization here.

Software Packages

However, most NLP packages provide ready to use tokenizers, so you don't really need to write your own. For example, if you're using python you can just use the TreebankWordTokenizer provided with NLTK. If you're using the Java based Stanford Parser, it will by default tokenize any sentence you give it using its edu.stanford.nlp.processor.PTBTokenizer.

dmcer
  • 8,116
  • 1
  • 35
  • 41
  • Thanks for giving us a pointer to the PTB tokenization method. While they don't enumerate what those "subtleties" are on hyphens vs dashes, and I'm not sure if "won't --> wo n't" or "gonna --> gon na" is appropriate, it can be a starter. +1 – OTZ Sep 14 '10 at 00:43
  • [This link](http://www.cis.upenn.edu/~treebank/tokenization.html) seems to be broken now. – Anderson Green Aug 13 '19 at 20:32
2

You probably shouldn't try to use a regular expression for tokenizing English text. In English some tokens have several different meanings and you can only know which is right by understanding the context in which they are found, and that requires understanding the meaning of the text to some extent. Examples:

  • The character ' could be an apostrophe or it could be used as a single-quote to quote some text.
  • The period could be the end of a sentence or it could signify an abbreviation. Or in some cases it could fulfil both roles simultaneously.

Try a natural language parser instead. For example you could use the Stanford Parser. It is free to use and will do a much better job than any regular expression at tokenizing English text. That's just one example though - there are also many other NLP libraries you could use.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • tokenizing != parsing. He's talking about lexing (unless I miss my guess). – Paul Nathan Sep 13 '10 at 20:00
  • @Nathan you got that right. Byers is referring to a tagger, which is not my focus. – OTZ Sep 13 '10 at 20:05
  • 1
    @Paul Nathan: You can't *accurately* tokenize English text using a regular expression. If you only want it to work some of the time and don't care about errors then you can probably get away with using a simple regular expression. If you want it to work most of the time then you need something more powerful. You could keep extending the regex to cover more and more special cases, but seeing as the more powerful solutions already exist and are free, why not just use them from the start? – Mark Byers Sep 13 '10 at 20:08
  • Pain of integration, for one thing. :-) OP hasn't discussed his target corpus. If it's a basic analysis, a regex will work. If it's for a more precise problem, of course you want a more developed system. At a guess, OP wants a basic hack, since an expert would frame the question much more precisely. Also Perl regexes are not true regexes, they are context-sensitive somethings. – Paul Nathan Sep 13 '10 at 20:13
1

You can split on [^\p{L}]+. It will split on each characters group which doesn't contains letters.


Resources :

Colin Hebert
  • 91,525
  • 15
  • 160
  • 151
0

There are some complexities.

A word will have [A-Za-z0-9\-]. But, you may have some other delimiters besides just the word! You can start with [(\s] and end with [),.-\s?:;!]

Paul Nathan
  • 39,638
  • 28
  • 112
  • 212
  • Noooo. Don't do this. Use \b instead. It matches a word boundary. So this would match a word: \b.+?\b – Rohan Singh Sep 13 '10 at 20:07
  • `\b` won't work properly if the word contains non-ASCII characters! – Daniel Vandersluis Sep 13 '10 at 20:09
  • @Rohan: That won't work for hyphenated words or apostrophe'd words. Also, this is *not* a full Perl regex. This is a *sample regex* meant to demonstrate in a non-Perl syntax a subset of possibility. – Paul Nathan Sep 13 '10 at 20:10