0

Trying to understand more about regular expressions in Python and I find it difficult to match any character (including newlines, tab characters, whitespaces, etc.) between two symbols, including those symbols.

For example:

  • foobar89\n\nfoo\tbar; '''blah blah blah'8&^"''' need to match ''blah blah blah'8&^"'''

  • fjfdaslfdj; '''blah\n blah\n\t\t blah\n'8&^"''' need to match '''blah\n blah\n\t\t blah\n'8&^"'''

(note, with the \n and \t symbols I signify the newlines and tab spaces in a text file)

Following this question, I have tried this ^.*\'''(.*)\'''.*$ and this *?\'''(.*)\'''.* with no success.

Could someone guide me as to what I am doing wrong? I would appreciate any brief explanation as well.

Also, in order to understand the concept of escaping of special characters, I wonder if I by replacing the two symbols (e.g. from ''' to """ or to ***) in the regular expression it would still work (for a relevant string)?

e.g. for

  • fjfdaslfdj; """blah\n blah\n\t\t blah\n'8&^""" need to match """blah\n blah\n\t\t blah\n'8&^"""

UPDATE

Code I am trying to test the regexes on (taken & modified from here):

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(code):
    token_specification = [
        # regexes suggested from [Thomas Ayoub][3]
        ('BOTH',      r'([\'"]{3}).*?\2'), # for both triple-single quotes and triple-double quotes
        ('SINGLE',    r"('''.*?''')"),     # triple-single quotes 
        ('DOUBLE',    r'(""".*?""")'),     # triple-double quotes 
        # regexes which match OK
        ('COM',       r'#.*'),
        ('NUMBER',  r'\d+(\.\d*)?'),  # Integer or decimal number
        ('ASSIGN',  r':='),           # Assignment operator
        ('END',     r';'),            # Statement terminator
        ('ID',      r'[A-Za-z]+'),    # Identifiers
        ('OP',      r'[+\-*/]'),      # Arithmetic operators
        ('NEWLINE', r'\n'),           # Line endings
        ('SKIP',    r'[ \t]+'),       # Skip over spaces and tabs
        ('MISMATCH',r'.'),            # Any other character
    ]

    test_regexes = ['COM', 'BOTH', 'SINGLE', 'DOUBLE']

    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    line_num = 1
    line_start = 0
    for mo in re.finditer(tok_regex, code):
        kind = mo.lastgroup
        value = mo.group(kind)
        if kind == 'NEWLINE':
            line_start = mo.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        elif kind == 'MISMATCH':
            pass
        else:
            if kind in test_regexes:
                print(kind, value)
            column = mo.start() - line_start
            yield Token(kind, value, line_num, column)

f = r'C:\path_to_python_file_with_above_examples'

with open(f) as sfile:
    content = sfile.read()

for t in tokenize(content):
    pass #print(t)
Community
  • 1
  • 1

1 Answers1

1

You can go with:

((['"]{3}).*?\2)

See live running python or live running regex


  • ^.*\'''(.*)\'''.*$ => you added anchors to start/end of line which doesn't work in case of multi-line matching need
  • *?\'''(.*)\'''.* => syntax error
  • re.compile(ur'(([\'"]{3}).*?\2)', re.MULTILINE | re.DOTALL) => re.DOTALL makes . match new line.
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • It does work for `'''` but not for `"""` and I do not understand why. Could you perhaps add a brief explanation of why the 2 regular expressions I tried did not work (especially since I got them from another answer where they were working)? Also, why do I not need to 'escape' the symbols (i.e. `\'''`)? –  Jun 21 '16 at 07:50
  • Thanks for the edit and explanation. Could you also briefly address the other 2 things (matching other symbols like `"""` and the escaping of characters)? ..I just edited both my comment above and my question to add a similar example input-output for `"""` –  Jun 21 '16 at 07:57
  • @nk-fford you *need* to escape the special char that's your string delimiter: `"string with \" as delimiter don't need to escape ' char"` and `'string with \' as delimiter don\'t need to escape " char"`. Thus, see the escape in the regex. If you need to capture other symbol you can use [character class](http://www.regular-expressions.info/charclass.html) – Thomas Ayoub Jun 21 '16 at 08:01
  • The character class very interesting. Just to be sure I get it right, say I want to match `"""bl"ah"\n 'blah'\n\t\t'8&^"""` from `"dasd a'\n\t """bl"ah"\n 'blah'\n\t\t'8&^""" dasd a`, could you please show a regex for that? –  Jun 21 '16 at 08:07
  • @nk-fford try [here](https://regex101.com/r/pF3uX1/1). If you succeed, go to *code generator* and see python. If you fail, come back to me ;) – Thomas Ayoub Jun 21 '16 at 08:09
  • @nk-fford any feedback? – Thomas Ayoub Jun 25 '16 at 10:11
  • I try `r"('''.*?''')|"r'(""".*?""")` in order to match both cases of triple single quotes and triple double quotes but the latter fail in Python. Interestingly, they both work if try them on [regex101](https://regex101.com/r/pF3uX1/1). In Python, when I switch to `r"('''.*?''')|"r"(""".*?""")"` I get a `Statement expected, found BAD_CHARACTER` for the second `.*?` part. Any suggestions please? –  Jun 25 '16 at 13:40
  • @nk-fford I've added the char class & backreference so it won't match if the comment opening & closing uses different characters. If you use `ur'(([\'"]{3}).*?\2)'` within your code it should works – Thomas Ayoub Jun 26 '16 at 08:34
  • Again, your new regex works on the online tools, but errors when put in my python code. When I put `r'(([\'"]{3}).*?\2)'` in [this code](https://docs.python.org/3/library/re.html#writing-a-tokenizer) I get a `sre_constants.error: cannot refer to an open group at position 26`. :( –  Jun 26 '16 at 09:45
  • Update: I had to very slightly modify the regex to `r'([\'"]{3}).*?\2'` to avoid the error in my [code](https://docs.python.org/3/library/re.html#writing-a-tokenizer). Unfortunately, it still only matches the triple single-quotes but not the triple double-quotes. Any suggestions please? –  Jun 26 '16 at 10:45
  • Could you edit your question with your Python code please @nk-fford ? – Thomas Ayoub Jun 26 '16 at 13:51
  • I added the code I use with some comments in my question. Please see and help me gain understanding. Hopefully you could suggest the regexes to match triple single and triple double quotes successfully. –  Jun 29 '16 at 15:24