I'm trying to tokenize something like hello world123
into hello
, world
and 123
.
I have the two parts of the code that are needed, I think, but cannot combine them to properly tokenize
.
(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)
I'm trying to tokenize something like hello world123
into hello
, world
and 123
.
I have the two parts of the code that are needed, I think, but cannot combine them to properly tokenize
.
(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)
You may use
import re
s = "hello world123"
print(re.findall(r'[^\W\d_]+|\d+', s))
# => ['hello', 'world', '123']
See the Python demo
Pattern details
[^\W\d_]+
- 1 or more letters|
- or\d+
- 1+ digits.See the regex demo.
BONUS: To match any letter substrings and numbers of various kinds use
[^\W\d_]+|[-+]?\d*\.?\d+(?:[eE][+-]?\d+)?
See this regex demo.
See Parsing scientific notation sensibly? for the regex details.