3

I'm trying to tokenize something like hello world123 into hello, world and 123. I have the two parts of the code that are needed, I think, but cannot combine them to properly tokenize.

(?u)\b\w+\b
(?<=\D)(?=\d)|(?<=\d)(?=\D)
PrakashG
  • 1,642
  • 5
  • 20
  • 30
David Kong
  • 578
  • 1
  • 5
  • 20
  • The first regex extracts whole words consisting of letters, digits and underscores (and there are some other Unicode categories included, but that is basically what `\w` matches) and the second one splits digits from non-digits. I think all you want is to match streaks of 1 or more letters or of 1or more digits. See my solution below. – Wiktor Stribiżew Feb 07 '19 at 09:05
  • Thanks, this is really helpful. Maybe to add to the complexity a little, how can we separate something like 1.5mL into "1.5" and "mL" – David Kong Feb 07 '19 at 11:12
  • [`[^\W\d_]+|\d+(?:\.\d+)?`](https://regex101.com/r/8qAxP4/1/) – Wiktor Stribiżew Feb 07 '19 at 11:14
  • 1
    Thank you for your help – David Kong Feb 08 '19 at 14:07

1 Answers1

2

You may use

import re
s = "hello world123"
print(re.findall(r'[^\W\d_]+|\d+', s))
# => ['hello', 'world', '123']

See the Python demo

Pattern details

  • [^\W\d_]+ - 1 or more letters
  • | - or
  • \d+ - 1+ digits.

See the regex demo.

BONUS: To match any letter substrings and numbers of various kinds use

[^\W\d_]+|[-+]?\d*\.?\d+(?:[eE][+-]?\d+)?

See this regex demo.

See Parsing scientific notation sensibly? for the regex details.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    and there may be a subtle difference between `[^\W\d_]` and `[A-Za-z]` I suppose, for localized accentuated letters – Jean-François Fabre Feb 07 '19 at 07:49
  • @Jean-FrançoisFabre Yes, also, here is a [relevant thread about `[^\W\d_]`](https://stackoverflow.com/questions/6314614/match-any-unicode-letter). For Python 2 users, to make it fully Unicode-aware, `re.U` flag (or its inline `(?u)` embedded flag) is necessary (the behavior is default in Python 3). And, in those scenarios, to match only ASCII digits, `[0-9]+` would be more appropriate than `\d+` that will match all Unicode digits. – Wiktor Stribiżew Feb 07 '19 at 07:52