2

I'm using python CountVectorizer to tokenize sentences and at the same time filter non-existant words like "1s2".

Which re pattern should I use to select only English words and numbers? The following regex pattern gets me pretty close:

pattern = '(?u)(?:\\b[a-zA-Z]+\\b)*(?:\\b[\d]+\\b)*'

vectorizer = CountVectorizer(ngram_range=(1, 1),
                             stop_words=None,
                             token_pattern=pattern)
tokenize = vectorizer.build_tokenizer()

tokenize('this is a test test1 and 12.')

['this', '', 'is', '', 'a', '', 'test', '', '', '', '',
 '', '', '', '', 'and', '', '12', '', '']

but I can't understand why it gives me so many empty list items ('').

Also, how can I keep the punctuation? In end I would like to result like this:

tokenize('this is a test test1 and 12.')

['this','is','a','test','and','12','.']
divibisan
  • 11,659
  • 11
  • 40
  • 58
Miguel
  • 2,738
  • 3
  • 35
  • 51

1 Answers1

2

I do not know whether sklearn's CountVectorizer can do it in one step (token_pattern is overwritten by tokenizer, I think), but you can do the following (based on this answer):

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
import re

vectorizer = CountVectorizer(ngram_range=(1,1), stop_words=None,
                             tokenizer=TreebankWordTokenizer().tokenize)
tokenize = vectorizer.build_tokenizer()

tokenList = tokenize('this is a test test1 and 12.')
# ['this', 'is', 'a', 'test', 'test1', 'and', '12', '.']

# Remove any token that (i) does not consist of letters or (ii) is a punctuation mark
tokenList = [token for token in tokenList if re.match('^([a-zA-Z]+|\d+|\W)$', token)]
# ['this', 'is', 'a', 'test', 'and', '12', '.']

EDIT: I forgot to tell you why your answer doesn't work.

  • "The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)." (How sklearn's token_pattern works). So punctuation mark is completely ignored.
  • Your pattern (?u)(?:\\b[a-zA-Z]+\\b)*(?:\\b[\d]+\\b)* is actually saying: 'Interpret as unicode, word boundaries with letters in between (or not (the *)) and word boundaries with digits in between (or not (again a *))'. Because of all the 'or not', a pattern like '' (nothing) is also what you're searching for!
divibisan
  • 11,659
  • 11
  • 40
  • 58
Nander Speerstra
  • 1,496
  • 6
  • 24
  • 29