0

I'm new to regex library, and I'm trying to make from a text like this

"""constructor SquareGame new(){
let square=square;
}"""

This outputs a list:

['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=',  'square', ';', '}']

I need to create a list of tokens separated by white spaces, new lines and this symbols {}()[].;,+-*/&|<>=~.

I used re.findall('[,;.()={}]+|\S+|\n', text) but seems to separate tokens by withe spaces and new lines only.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
TheSaxo
  • 17
  • 5

2 Answers2

2

You may use

re.findall(r'\w+|[^\w \t]', text)

To avoid matching any Unicode horizontal whitespace use

re.findall(r'\w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]', text)

See the regex demo. Details:

  • \w+ - 1 or more word chars
  • | - or
  • [^\w \t] - a single non-word char that is not a space and a tab char (so, all vertical whitespace is matched).

You may add more horizontal whitespace chars to exclude into the [^\w \t] character class, see their list at Match whitespace but not newlines. The regex will look like \w+|[^\w \t\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000].

See the Python demo:

import re
pattern = r"\w+|[^\w \t]"
text = "constructor SquareGame new(){\nlet square=square;\n}"
print ( re.findall(pattern, text) )
# => ['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

This regex will only match based on the characters that you indicated and I think this is a safer method.

>>> re.findall(r"\w+|[{}()\[\].;,+\-*/&|<>=~\n]", text)
['constructor', 'SquareGame', 'new', '(', ')', '{', '\n', 'let', 'square', '=', 'square', ';', '\n', '}'
Ronie Martinez
  • 1,254
  • 1
  • 10
  • 14