0

I am expecting the following code; tokenize

this is an example 123

into

['this', 'is', 'an', 'example 123'] 

but it doesn't see numbers part of the word. Any suggestion?

import re
from nltk.tokenize import RegexpTokenizer
pattern=re.compile(r"[\w\s\d]+")
tokenizer_number=RegexpTokenizer(pattern)
tokenizer_number.tokenize("this is an example 123")
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
Rebecca
  • 341
  • 1
  • 4
  • 12

3 Answers3

1

A pretty well formed regex :

[\d.,]+|[A-Z][.A-Z]+\b\.*|\w+|\S

This topic was solved before in : Here!

,You can test regex interactively with https://regex101.com

A.HEDDAR
  • 299
  • 2
  • 4
0

Using str.split():

s = "this is an example 123"    
print(s.split(" ", 3))

OUTPUT:

['this', 'is', 'an', 'example 123']
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
0

Your regex is wrong. You are matching any sequence of letters, digits or spaces. You meant this instead:

pattern=re.compile(r"\w+\s\d+|\w+")

Or equivalently, you could write that as r"\w+(?:\s\d+)?".

alexis
  • 48,685
  • 16
  • 101
  • 161