[Follow up from my old question with better description and links]
Trying to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.
For example:
foobar89\n\nfoo\tbar; '''blah blah blah'8&^"'''
need to match
''blah blah blah'8&^"'''
and
fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"'''
need to match
'''blah\n blah\n\t\t blah\n'8&^"'''
My Python code (taken and adapted from here) onto which I am testing the regexes :
import collections
import re
Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
def tokenize(code):
token_specification = [
('BOTH', r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
('SINGLE', r"('''.*?''')"), # triple-single quotes
('DOUBLE', r'(""".*?""")'), # triple-double quotes
# regexes which match OK
('COM', r'#.*'),
('NEWLINE', r'\n'), # Line endings
('SKIP', r'[ \t]+'), # Skip over spaces and tabs
('MISMATCH',r'.'), # Any other character
]
test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']
tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
line_num = 1
line_start = 0
for mo in re.finditer(tok_regex, code):
kind = mo.lastgroup
value = mo.group(kind)
if kind == 'NEWLINE':
line_start = mo.end()
line_num += 1
elif kind == 'SKIP':
pass
elif kind == 'MISMATCH':
pass
else:
if kind in test_regexes:
print(kind, value)
column = mo.start() - line_start
yield Token(kind, value, line_num, column)
f = r'C:\path_to_python_file_with_examples_to_match'
with open(f) as sfile:
content = sfile.read()
for t in tokenize(content):
pass #print(t)
where the file_with_examples_to_match
is:
import csv, urllib
class Q():
"""
This class holds lhghdhdf hgh dhghd hdfh ghd fh.
"""
def __init__(self, l, lo, d, m):
self.l= l
self.lo= longitude
self.depth = d
self.m= m
def __str__(self):
# sdasda fad fhs ghf dfh
d= self.d
if d== -1:
d= 'unknown'
m= self.m
if m== -1:
d= 'unknown'
return (m, d, self.l, self.lo)
foobar89foobar; '''blah qsdkfjqsv,;sv
vqùlvnqùv
dqvnq
vq
v
blah blah'8&^"'''
fjfdaslfdj; '''blah blah
blah
'8&^"'''
From this answer, I try r"('''.*?''')|"r'(""".*?""")
to match both cases of triple single-quotes and triple double-quotes without success. Same when trying r'([\'"]{3}).*?\2')
.
I have set up an online regex tester where some of the regexes do match as they are supposed to but when in the code above they fail.
I am interested in gaining understanding in Python's regular expressions so I would appreciate both a solution (perhaps a valid regex to do the desired matching on my code) and a brief explanation so I can see my shortcomings.