Regular expression tokenization with numbers?

Question

I am expecting the following code; tokenize

this is an example 123

into

['this', 'is', 'an', 'example 123']

but it doesn't see numbers part of the word. Any suggestion?

import re
from nltk.tokenize import RegexpTokenizer
pattern=re.compile(r"[\w\s\d]+")
tokenizer_number=RegexpTokenizer(pattern)
tokenizer_number.tokenize("this is an example 123")

See https://stackoverflow.com/questions/55619297/how-to-prevent-splitting-specific-words-or-phrases-and-numbers-in-nltk — alvas, Apr 12 '19 at 05:02

score 1 · Accepted Answer · answered Apr 09 '19 at 14:09

1

A pretty well formed regex :

[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S

This topic was solved before in : Here!

,You can test regex interactively with https://regex101.com

answered Apr 09 '19 at 14:09

A.HEDDAR

299
2
4

This pattern works: pattern=r'[\w]+[\s]+[\d?]+[\w]|\w+\S' – Rebecca Apr 10 '19 at 09:00

score 0 · Answer 2 · answered Apr 09 '19 at 13:47

0

Using str.split():

s = "this is an example 123"    
print(s.split(" ", 3))

OUTPUT:

['this', 'is', 'an', 'example 123']

answered Apr 09 '19 at 13:47

DirtyBit

16,613
4
34
55

This is hardcoded to solve only that example; it won't work in general cases at all – Proyag Apr 10 '19 at 08:54

alexis · Answer 3 · 2019-04-10T09:20:33.203

0

Your regex is wrong. You are matching any sequence of letters, digits or spaces. You meant this instead:

pattern=re.compile(r"\w+\s\d+|\w+")

Or equivalently, you could write that as r"\w+(?:\s\d+)?".

edited Apr 10 '19 at 09:20

answered Apr 10 '19 at 08:32

alexis

48,685
16
101
161

Regular expression tokenization with numbers?

3 Answers3