How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

Question

More specifically I want to split a string on any non alpha-numeric character but in the case that the delimiter is not a white space I want to keept it. That is, to the input:

my_string = 'Hey, I\'m 9/11 7-11'

I want to get:

['Hey' , ',' , 'I' , "'" , 'm', '9' , '/' , '11', '7' , '-' , '11']

Without no whitespace as a list element.

I have tried the following:

re.split('([/\'\-_,.;])|\s', my_string)

But outputs:

['Hey', ',', '', None, 'I', "'", 'm', None, '9', '/', '11', None, '7', '-', '11']

How do I solve this without 'unnecessary' iterations?

Also I have some trouble with escaping the backslash character, since '\\\\' does not seem to be working, any ideas on how to also solve this?

Thanks a lot.

score 3 · Accepted Answer · answered Apr 25 '17 at 20:57

3

You may use

import re
my_string = "Hey, I'm 9/11 7-11"
print(re.findall(r'\w+|[^\w\s]', my_string))
# => ['Hey', ',', 'I', "'", 'm', '9', '/', '11', '7', '-', '11']

See the Python demo

The \w+|[^\w\s] regex matches either 1+ word chars (letters, digits, _ symbols) or a single character other than a word and whitespace char.

BTW, to match a backslash with a regex, you need to use \\ in a raw string literal (r'\\') or 4 backslashes in a regular one ('\\\\'). It is recommended to use raw string literals to define a regex pattern in Python.

answered Apr 25 '17 at 20:57

Wiktor Stribiżew

607,720
39
448
563

Since I want to split the string, not find elements. Is findall the 'correct' method? I have found some discussion over this topic on http://stackoverflow.com/questions/1059559/split-strings-with-multiple-delimiters – h3h325 Apr 25 '17 at 21:06
1

You seem to want to *tokenize* the string, right? Into "words" (groups of word symbols) and non-word symbols (separately)? Else, use [`res = filter(None, re.split('([/\'\-_,.;])|\s', my_string))`](http://ideone.com/hhAVxN) (with `filter` to get rid of empty values). You are bound to get empty elements when using split with a regex containing capturing groups, it happens often due to the string input. – Wiktor Stribiżew Apr 25 '17 at 21:07
Yes, exactly. The code you provided works just fine (Thanks!). I was just asking because I have seen someone say this was not the 'correct' way of doing it. – h3h325 Apr 25 '17 at 21:10
Well, look: we match `\w+` (all word char streaks) or the opposite, `\W` (=`[^\w]`, non-word chars), globally, all occurrences (due to `re.findall`), but with the exception of whitespace (adding `\s` to the negated character class - `[^\w\s]`). We can't miss anything that way. All we omit is whitespace. Also, `re.findall` lets control tokenization better. – Wiktor Stribiżew Apr 25 '17 at 21:12

How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

1 Answers1

Linked

Related