0

New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!

General Grievance
  • 4,555
  • 31
  • 31
  • 45
Not here
  • 19
  • 1
  • 1
  • 1
    How exactly does your input look like? In generally you can use ```example_string.split()``` to retrieve a list of words (without an argument, this is split by whitespace). – Ohley Oct 13 '20 at 10:05

3 Answers3

0

Depending on the complexity you can simply use the string split function.

# Words independent of sentences
words = raw_text.split(' ')

# Sentences and words
sentences = raw_text.split('. ')
words_in_sentences = [sentence.split(' ') for sentence in sentences]

If you want to do something more sophisticated you can use packages like re, which provides support for regular expressions. [Related question]

mpSchrader
  • 902
  • 3
  • 20
0

I assuming you are talking about a tokenizer for a compiler. Such tokens are usually definable by a regular language for which regular expressions/finite state automata are the natural solutions. An example:

import re
from collections import namedtuple


Token = namedtuple('Token', ['type','value'])

def lexer(text):

    IDENTIFIER = r'(?P<IDENTIFIER>[a-zA-Z_][a-zA-Z_0-9]*)'
    ASSIGNMENT = r'(?P<ASSIGNMENT>=)'
    NUMBER = r'(?P<NUMBER>\d+)'
    MULTIPLIER_OPERATOR = r'(?P<MULTIPLIER_OPERATOR>[*/])'
    ADDING_OPERATOR = r'(?P<ADDING_OPERATOR>[+-])'
    WHITESPACE = r'(?P<WHITESPACE>\s+)'
    EOF = r'(?P<EOF>\Z)'
    ERROR = r'(?P<ERROR>.)' # catch everything else, which is an error

    tokenizer = re.compile('|'.join([IDENTIFIER, ASSIGNMENT, NUMBER, MULTIPLIER_OPERATOR, ADDING_OPERATOR, WHITESPACE, EOF, ERROR]))
    seen_error = False
    for m in tokenizer.finditer(text):
        if m.lastgroup != 'WHITESPACE': #ignore whitespace
            if m.lastgroup == 'ERROR':
                if not seen_error:
                    yield Token(m.lastgroup, m.group())
                    seen_error = True # scan until we find a non-error input
            else:
                yield Token(m.lastgroup, m.group())
                seen_error = False
        else:
            seen_error = False
                

for token in lexer('foo = x12 * y / z - 3'):
    print(token)

Prints:

Token(type='IDENTIFIER', value='foo')
Token(type='ASSIGNMENT', value='=')
Token(type='IDENTIFIER', value='x12')
Token(type='MULTIPLIER_OPERATOR', value='*')
Token(type='IDENTIFIER', value='y')
Token(type='MULTIPLIER_OPERATOR', value='/')
Token(type='IDENTIFIER', value='z')
Token(type='ADDING_OPERATOR', value='-')
Token(type='NUMBER', value='3')
Token(type='EOF', value='')

The above code defines each token such as IDENTIFIER, ASSIGNMENT, etc. as simple regular expressions and then combines them into a single regular expression pattern using the | operator and compiles the expression as variable tokenizer. It then uses the regular expression finditer method with the input text as its argument to create a "scanner" that tries to match successive input tokens against the tokenizer regular expression. As long as there are matches, Token instances consisting of type and value are yielded by the lexer generator function. In this example, WHITESPACE tokens are not yielded on the assumption that whitespace is to be ignored by the parser and only serves to separate other tokens.

There is a catchall ERROR token defined as the last token that will match a single character if none of the other token regular expressions match (a . is used for this, which will not match a newline character unless flag re.S is used, but there is no need to match a newline since the newline character is being matched by the WHITESPACE token regular expression and is therefor a "legal" match). Special code is added to prevent successive ERROR tokens being generated. In effect, the lexer generates an ERROR token and then throws away input until it can once again match a legal token.

Booboo
  • 38,656
  • 3
  • 37
  • 60
0

use gensim instead.

tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

DSBLR
  • 555
  • 5
  • 9