New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!
3 Answers
Depending on the complexity you can simply use the string split
function.
# Words independent of sentences
words = raw_text.split(' ')
# Sentences and words
sentences = raw_text.split('. ')
words_in_sentences = [sentence.split(' ') for sentence in sentences]
If you want to do something more sophisticated you can use packages like re, which provides support for regular expressions. [Related question]

- 902
- 3
- 20
I assuming you are talking about a tokenizer for a compiler. Such tokens are usually definable by a regular language for which regular expressions/finite state automata are the natural solutions. An example:
import re
from collections import namedtuple
Token = namedtuple('Token', ['type','value'])
def lexer(text):
IDENTIFIER = r'(?P<IDENTIFIER>[a-zA-Z_][a-zA-Z_0-9]*)'
ASSIGNMENT = r'(?P<ASSIGNMENT>=)'
NUMBER = r'(?P<NUMBER>\d+)'
MULTIPLIER_OPERATOR = r'(?P<MULTIPLIER_OPERATOR>[*/])'
ADDING_OPERATOR = r'(?P<ADDING_OPERATOR>[+-])'
WHITESPACE = r'(?P<WHITESPACE>\s+)'
EOF = r'(?P<EOF>\Z)'
ERROR = r'(?P<ERROR>.)' # catch everything else, which is an error
tokenizer = re.compile('|'.join([IDENTIFIER, ASSIGNMENT, NUMBER, MULTIPLIER_OPERATOR, ADDING_OPERATOR, WHITESPACE, EOF, ERROR]))
seen_error = False
for m in tokenizer.finditer(text):
if m.lastgroup != 'WHITESPACE': #ignore whitespace
if m.lastgroup == 'ERROR':
if not seen_error:
yield Token(m.lastgroup, m.group())
seen_error = True # scan until we find a non-error input
else:
yield Token(m.lastgroup, m.group())
seen_error = False
else:
seen_error = False
for token in lexer('foo = x12 * y / z - 3'):
print(token)
Prints:
Token(type='IDENTIFIER', value='foo')
Token(type='ASSIGNMENT', value='=')
Token(type='IDENTIFIER', value='x12')
Token(type='MULTIPLIER_OPERATOR', value='*')
Token(type='IDENTIFIER', value='y')
Token(type='MULTIPLIER_OPERATOR', value='/')
Token(type='IDENTIFIER', value='z')
Token(type='ADDING_OPERATOR', value='-')
Token(type='NUMBER', value='3')
Token(type='EOF', value='')
The above code defines each token such as IDENTIFIER
, ASSIGNMENT
, etc. as simple regular expressions and then combines them into a single regular expression pattern using the |
operator and compiles the expression as variable tokenizer
. It then uses the regular expression finditer
method with the input text as its argument to create a "scanner" that tries to match successive input tokens against the tokenizer
regular expression. As long as there are matches, Token
instances consisting of type and value are yielded by the lexer
generator function. In this example, WHITESPACE
tokens are not yielded on the assumption that whitespace is to be ignored by the parser and only serves to separate other tokens.
There is a catchall ERROR
token defined as the last token that will match a single character if none of the other token regular expressions match (a .
is used for this, which will not match a newline character unless flag re.S
is used, but there is no need to match a newline since the newline character is being matched by the WHITESPACE
token regular expression and is therefor a "legal" match). Special code is added to prevent successive ERROR
tokens being generated. In effect, the lexer
generates an ERROR
token and then throws away input until it can once again match a legal token.

- 38,656
- 3
- 37
- 60
use gensim instead.
tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

- 555
- 5
- 9