Trying to understand more about regular expressions in Python and I find it difficult to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.
For example:
foobar89\n\nfoo\tbar; '''blah blah blah'8&^"'''
need to match''blah blah blah'8&^"'''
fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"'''
need to match'''blah\n blah\n\t\t blah\n'8&^"'''
(note, with the \n
and \t
symbols I signify the newlines and tab spaces in a text file)
Following this question, I have tried this ^.*\'''(.*)\'''.*$
and this *?\'''(.*)\'''.*
with no success.
Could someone guide me as to what I am doing wrong? I would appreciate any brief explanation as well.
Also, in order to understand the concept of escaping of special characters, I wonder if I by replacing the two symbols (e.g. from '''
to """
or to ***
) in the regular expression it would still work (for a relevant string)?
e.g. for
fjfdaslfdj; """blah\n blah\n\t\t blah\n'8&^"""
need to match"""blah\n blah\n\t\t blah\n'8&^"""
UPDATE
Code I am trying to test the regexes on (taken & modified from here):
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
# regexes suggested from [Thomas Ayoub][3]
('BOTH', r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
('SINGLE', r"('''.*?''')"), # triple-single quotes
('DOUBLE', r'(""".*?""")'), # triple-double quotes
# regexes which match OK
('COM', r'#.*'),
('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
('ASSIGN', r':='), # Assignment operator
('END', r';'), # Statement terminator
('ID', r'[A-Za-z]+'), # Identifiers
('OP', r'[+\-*/]'), # Arithmetic operators
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH',r'.'), # Any other character
]
test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
elif kind == 'MISMATCH':
pass
else:
if kind in test_regexes:
print(kind, value)
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
f = r'C:\path_to_python_file_with_above_examples'
with open(f) as sfile:
content = sfile.read()
for t in tokenize(content):
pass #print(t)