2

[Follow up from my old question with better description and links]

Trying to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.

For example:

foobar89\n\nfoo\tbar; '''blah blah blah'8&^"'''

need to match

''blah blah blah'8&^"'''

and

fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"'''

need to match

'''blah\n blah\n\t\t blah\n'8&^"'''

My Python code (taken and adapted from here) onto which I am testing the regexes :

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(code):
    token_specification = [
        ('BOTH',      r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
        ('SINGLE',    r"('''.*?''')"),     # triple-single quotes 
        ('DOUBLE',    r'(""".*?""")'),     # triple-double quotes 
        # regexes which match OK
        ('COM',       r'#.*'),
        ('NEWLINE', r'\n'),           # Line endings
        ('SKIP',    r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH',r'.'),            # Any other character
    ]

    test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']

    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        elif kind == 'MISMATCH':
            pass
        else:
            if kind in test_regexes:
                print(kind, value)
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)

f = r'C:\path_to_python_file_with_examples_to_match'

with open(f) as sfile:
    content = sfile.read()

for t in tokenize(content):
    pass #print(t)

where the file_with_examples_to_match is:

import csv, urllib

class Q():
    """
    This class holds lhghdhdf hgh dhghd hdfh ghd fh.
    """

    def __init__(self, l, lo, d, m):
        self.l= l
        self.lo= longitude
        self.depth = d
        self.m= m

    def __str__(self):
        # sdasda fad fhs ghf dfh
        d= self.d
        if d== -1:
            d= 'unknown'
        m= self.m
        if m== -1:
            d= 'unknown'

        return (m, d, self.l, self.lo)

foobar89foobar; '''blah qsdkfjqsv,;sv
                   vqùlvnqùv 
                   dqvnq
                   vq
                   v

blah blah'8&^"'''
fjfdaslfdj; '''blah blah
     blah
    '8&^"'''

From this answer, I try r"('''.*?''')|"r'(""".*?""") to match both cases of triple single-quotes and triple double-quotes without success. Same when trying r'([\'"]{3}).*?\2').

I have set up an online regex tester where some of the regexes do match as they are supposed to but when in the code above they fail.

I am interested in gaining understanding in Python's regular expressions so I would appreciate both a solution (perhaps a valid regex to do the desired matching on my code) and a brief explanation so I can see my shortcomings.

Community
  • 1
  • 1
  • 1
    I think I am failing to understand what you are looking for. Due to the greedy nature of python regexes, '.*' should capture anything between two apostrophes, including any apostrophes. What exactly is the issue? – Jason Bray Jun 30 '16 at 19:23
  • @JasonBray The issue is that I trying to match anything between 3 consecutive double quotes or 3 consecutive single quotes. When I use the regexes `r"('''.*?''')"`, `r'(""".*?""")'`, `r'([\'"]{3}).*?\2')` even though the online regex testers show that these regexes do match as desired, when they are used in the code in my description they do not match. Looking for understanding why. –  Jun 30 '16 at 21:16

1 Answers1

0

You're probably missing flags to make . match newline also

re.finditer(tok_regex, code, flags = re.DOTALL)

In this case the output is

('BOTH', '"""\n    This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n    """')
('COM', '# sdasda fad fhs ghf dfh\n        d= self.d\n        if d== -1:\n            d= \'unknown\'\n        m= self.m\n        if m== -1:\n            d= \'unknown\'\n\n        return (m, d, self.l, self.lo)\n\nfoobar89foobar; \'\'\'blah qsdkfjqsv,;sv\n                   vq\xc3\xb9lvnq\xc3\xb9v \n                   dqvnq\n                   vq\n                   v\n\nblah blah\'8&^"\'\'\'\nfjfdaslfdj; \'\'\'blah blah\n     blah\n    \'8&^"\'\'\'')

COM is now matching way too much, since . now gets everything to the end of file. If we modify this pattern a bit to make it less greedy

('COM',       r'#.*?$')

we can now use re.MULTILINE to make it match less

re.finditer(tok_regex, code, flags = re.DOTALL | re.MULTILINE)

The output now is

('BOTH', '"""\n    This class holds lhghdhdf hgh dhghd hdfh ghd fh.\n    """')
('COM', '# sdasda fad fhs ghf dfh')
('BOTH', '\'\'\'blah qsdkfjqsv,;sv\n                   vq\xc3\xb9lvnq\xc3\xb9v \n                   dqvnq\n                   vq\n                   v\n\nblah blah\'8&^"\'\'\'')
('BOTH', '\'\'\'blah blah\n     blah\n    \'8&^"\'\'\'')

If you definitely don't want to use flags, you can use a kind of 'hack' to do without ., since this metacharacter matches almost everything, except newlines. You can create a match group, which would match everything but one symbol, which is highly unlikely to be present in files you would parse. For example, you could use a character with an ASCII code 0. Regex for such character would be \x00, the corresponding pattern [^\x00] would match every symbol (even newlines), except symbol with ASCII code 0 (that's why it's a hack, you aren't able to match every symbol without flags). You'll need to keep initial regex for COM, and for BOTH it would be

('BOTH',      r'([\'"]{3})[^\x00]*?\2')

Highly recommended for working with regex are online tools which explain them, like regex101

For more complex cases of quote matching you'll need to write a parser. See for example this Can the csv format be defined by a regex? and this When you should NOT use Regular Expressions?.

Community
  • 1
  • 1
buld0zzr
  • 962
  • 5
  • 10
  • I understand why regexes are not suitable for this kind of problem and that indeed a parser is more suitable (excellent links by the way). However, this is more an approach/exercise for me to get an exact feel of the limitations of regexes. For this case, a good-enough regex which will match anything between 3 single or double quotes is my goal. –  Jun 30 '16 at 21:09
  • That is perfect. I could have never thought that the `flags` were the solution. I wonder if besides modifying `COM` to be less greedy, other regexes require similar modification as well. e.g. if I also had `('NUMBER', r'\d+(\.\d*)?')` or `('ID', r'[A-Za-z]+')` would they also require modification to be less greedy? –  Jul 01 '16 at 16:45
  • Also, interestingly the regex `('NOFLAGS', r'/\*[^*]*\*+(?:[^/*][^*]*\*+)*/')` matches everything starting with `/*` and ending with `*/` without the use of the `flags = re.DOTALL | re.MULTILINE`. As an understanding exercise I would appreciate if I could see how your regexes could be modified to match without the use of flags like `NOFLAGS` does (if that is possible of course). –  Jul 01 '16 at 17:07
  • @nk-fford that depends on the situation. If you need a greedy match, then you need to use `+` or `*` by themselves. If not, you add a `?` mark to make them lazy. See for example http://www.rexegg.com/regex-quantifiers.html#lazy_solution. However `('ID', r'[A-Za-z]+?')` would match only a single symbol. Also, if my answer was useful to you please accept it – buld0zzr Jul 01 '16 at 17:09
  • 1
    @nk-fford `[^*]` and `.` are different in that the first group matches everything (including newlines) except `*`, and `.` matches almost every symbol, but not newline by default. In theory you can replace `.` with a negative group which uses a single symbol which would never appear in the text, however, it would make you regex less readable – buld0zzr Jul 01 '16 at 17:11
  • I think I understand the difference and looked up at the docs to see a few examples. If it is not too much of an effort, could you please provide the regexes of your solution with a 'no flags' version so that I can compare and see in practice the differences? –  Jul 01 '16 at 17:31
  • 1
    @nk-fford well, that's more of a hack than a proper solution, but assuming we don't ever get \x00 in text regex for `BOTH` would be `r'([\'"]{3})[^\x00]*?\2'` and for `COM` `r'#.*'`. This way it works without any flags set and provide the last output in the answer – buld0zzr Jul 01 '16 at 20:17
  • Your 'hack' works perfectly. Could you please briefly explain the difference between yours and the version with the flags? I do not understand the `[^\x00]`. Also, if you could edit your answer to include the 'hach' it would be even more complete of an answer. –  Jul 02 '16 at 07:37
  • 1
    @nk-fford included hack description in the answer with more explanation what `[^\x00`] is – buld0zzr Jul 02 '16 at 13:03
  • On a -probably- related issue, I get a weird `sre_constants.error: cannot refer to an open group at position 142` which only appears when I use `r'([\'"]{3})[^\x00]*?\2'` or the one with the flags version. (unfortunately I can not yet post the whole code cause it is bigger that the one I have on this question; if the error does not indicate an obvious problem anyone can see I will do a new SO question) –  Jul 02 '16 at 16:48