How to tokenize, splitting the adjacent digit-letter?

Question

I'm trying to tokenize something like hello world123 into hello, world and 123. I have the two parts of the code that are needed, I think, but cannot combine them to properly tokenize.

(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)

The first regex extracts whole words consisting of letters, digits and underscores (and there are some other Unicode categories included, but that is basically what `\w` matches) and the second one splits digits from non-digits. I think all you want is to match streaks of 1 or more letters or of 1or more digits. See my solution below. — Wiktor Stribiżew, Feb 07 '19 at 09:05
Thanks, this is really helpful. Maybe to add to the complexity a little, how can we separate something like 1.5mL into "1.5" and "mL" — David Kong, Feb 07 '19 at 11:12
[`[^\W\d_]+|\d+(?:\.\d+)?`](https://regex101.com/r/8qAxP4/1/) — Wiktor Stribiżew, Feb 07 '19 at 11:14

Wiktor Stribiżew · Accepted Answer · 2019-02-07T11:17:01.263

2

You may use

import re
s = "hello world123"
print(re.findall(r'[^\W\d_]+|\d+', s))
# => ['hello', 'world', '123']

See the Python demo

Pattern details

[^\W\d_]+ - 1 or more letters
| - or
\d+ - 1+ digits.

See the regex demo.

BONUS: To match any letter substrings and numbers of various kinds use

[^\W\d_]+|[-+]?\d*\.?\d+(?:[eE][+-]?\d+)?

See this regex demo.

See Parsing scientific notation sensibly? for the regex details.

edited Feb 07 '19 at 11:17

answered Feb 07 '19 at 07:46

Wiktor Stribiżew

607,720
39
448
563

1

and there may be a subtle difference between `[^\W\d_]` and `[A-Za-z]` I suppose, for localized accentuated letters – Jean-François Fabre Feb 07 '19 at 07:49
@Jean-FrançoisFabre Yes, also, here is a [relevant thread about `[^\W\d_]`](https://stackoverflow.com/questions/6314614/match-any-unicode-letter). For Python 2 users, to make it fully Unicode-aware, `re.U` flag (or its inline `(?u)` embedded flag) is necessary (the behavior is default in Python 3). And, in those scenarios, to match only ASCII digits, `[0-9]+` would be more appropriate than `\d+` that will match all Unicode digits. – Wiktor Stribiżew Feb 07 '19 at 07:52

How to tokenize, splitting the adjacent digit-letter?

1 Answers1