NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

Question

I am trying to Tokenize text using RegexpTokenizer.

Code:

from nltk.tokenize import RegexpTokenizer
#from nltk.tokenize import word_tokenize

line = "U.S.A Count U.S.A. Sec.of U.S. Name:Dr.John Doe J.Doe 1.11 1,000 10--20 10-20"
pattern = '[\d|\.|\,]+|[A-Z][\.|A-Z]+\b[\.]*|[\w]+|\S'
tokenizer = RegexpTokenizer(pattern)

print tokenizer.tokenize(line)
#print word_tokenize(line)

Output:

['U', '.', 'S', '.', 'A', 'Count', 'U', '.', 'S', '.', 'A', '.', 'Sec', '.', 'of', 'U', '.', 'S', '.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J', '.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']

Expected Output:

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']

Why tokenizer is also spiltting my expected tokens "U.S.A" , "U.S."? How can I resolve this issue?

My regex : https://regex101.com/r/dS1jW9/1

score 8 · Accepted Answer · answered Aug 25 '16 at 12:23

The point is that your \b was a backspace character, you need to use a raw string literal. Also, you have literal pipes in the character classes that would also mess your output.

This works as expected:

>>> pattern = r'[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S'
>>> tokenizer = RegexpTokenizer(pattern)
>>> print(tokenizer.tokenize(line))

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J.', 'Doe', '1.11', '1,000', '10', '-', '-', '20', '10', '-', '20']

Note that putting a single \w into a character class is pointless. Also, you do not need to escape every non-word char (like a dot) in the character class as they are mostly treated as literal chars there (only ^, ], - and \ require special attention).

score 0 · Answer 2 · answered Aug 25 '16 at 12:22

0

If you mod your regex

pattern = '[USA\.]{4,}|[\w]+|[\S]'

Then

pattern = '[USA\.]{4,}|[\w]+'
tokenizer = RegexpTokenizer(pattern)
print (''+str(tokenizer.tokenize(line)))

You get the output that you wanted

['U.S.A', 'Count', 'U.S.A.', 'Sec', '.', 'of', 'U.S.', 'Name', ':', 'Dr', '.', 'John', 'Doe', 'J', '.', 'Doe', '1', '.', '11', '1', ',', '000', '10', '-', '-', '20', '10', '-', '20']

answered Aug 25 '16 at 12:22

Tim Seed

5,119
2
30
26

`'[USA\.]{4,}|[\w]+'` will also match `............`. There is no need putting a single `\w` into a character class and escaping a dot inside a character class. – Wiktor Stribiżew Aug 25 '16 at 12:24
Agreed - but as the test data was given (and I was too lazy to think of a better solution) this is what I gave :) – Tim Seed Aug 25 '16 at 12:27

NLTK - nltk.tokenize.RegexpTokenizer - regex not working as expected

2 Answers2

Linked