I'm using python CountVectorizer
to tokenize sentences and at the same time filter non-existant words like "1s2".
Which re pattern should I use to select only English words and numbers? The following regex pattern gets me pretty close:
pattern = '(?u)(?:\\b[a-zA-Z]+\\b)*(?:\\b[\d]+\\b)*'
vectorizer = CountVectorizer(ngram_range=(1, 1),
stop_words=None,
token_pattern=pattern)
tokenize = vectorizer.build_tokenizer()
tokenize('this is a test test1 and 12.')
['this', '', 'is', '', 'a', '', 'test', '', '', '', '',
'', '', '', '', 'and', '', '12', '', '']
but I can't understand why it gives me so many empty list items (''
).
Also, how can I keep the punctuation? In end I would like to result like this:
tokenize('this is a test test1 and 12.')
['this','is','a','test','and','12','.']