Split string with regex by new lines, symbols and withspaces in python

Question

I'm new to regex library, and I'm trying to make from a text like this

"""constructor SquareGame new(){
let square=square;
}"""

This outputs a list:

['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=',  'square', ';', '}']

I need to create a list of tokens separated by white spaces, new lines and this symbols {}()[].;,+-*/&|<>=~.

I used re.findall('[,;.()={}]+|\S+|\n', text) but seems to separate tokens by withe spaces and new lines only.

score 2 · Answer 1 · answered Jun 25 '20 at 09:52

You may use

re.findall(r'\w+|[^\w \t]', text)

To avoid matching any Unicode horizontal whitespace use

re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)

See the regex demo. Details:

\w+ - 1 or more word chars
| - or
[^\w \t] - a single non-word char that is not a space and a tab char (so, all vertical whitespace is matched).

You may add more horizontal whitespace chars to exclude into the [^\w \t] character class, see their list at Match whitespace but not newlines. The regex will look like \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].

See the Python demo:

import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']

Thanks this works, even if i ddon't understand your code so much — TheSaxo, Jun 28 '20 at 07:37

score 0 · Accepted Answer · answered Jun 25 '20 at 10:01

This regex will only match based on the characters that you indicated and I think this is a safer method.

>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'

Split string with regex by new lines, symbols and withspaces in python

2 Answers2