2

My regexp is:

TMP_REGEXP = r'_\(\s*(.*)\s*\)\s*$'
TMP_PATTERN = re.compile(TMP_REGEXP, re.MULTILINE)

File input_data.txt:

print _(
    'Test #A'
    )              

print _(
    '''Test #B'''
    '''Test #C'''
)

I am running this like that:

with codecs.open('input_data.txt', encoding='utf-8') as flp:
    content = flp.read()

extracted = re.findall(TMP_PATTERN, content)

What I want to achieve is: - take all characters that follow '_(' - end reading characters if there is ')' followed by zero or more whitespaces and end of line

What is interesting 'Test #A' works like a charm bu 'Test #B' is skipped.

Drachenfels
  • 3,037
  • 2
  • 32
  • 47

1 Answers1

4

This worked for me:

m = re.findall(r'(?s)_\((.*?)\)', content)

(?s) looks for anything (including newlines).

_\( matches your desired start.

(.*?) looks for something.

\) matches your end.

You might want $ at the end and to do some stripping.

>>> content = """
... print _(
...     'Test #A'
...     )              
... 
... print _(
...     '''Test #B'''
...     '''Test #C'''
... )
... """
>>> import re
>>> m = re.findall(r'(?s)_\((.*?)\)', content)
>>> for i, match in enumerate(m, 1):
...     print("Match {0}: {1}".format(i, match))
... 
Match 1: 
    'Test #A'

Match 2: 
    '''Test #B'''
    '''Test #C'''

>>>
erip
  • 16,374
  • 11
  • 66
  • 121
  • There is one twist to figure out, what if there is Test #B ('a', 'b')\n Test #C, that is why I want to end reading when there is ) followed by nothing. But overall I am step closer, thank you. – Drachenfels Jun 08 '16 at 17:00
  • @Drachenfels Then you can't do it with a regex. That is not a regular language, so a regex cannot match that language. – erip Jun 08 '16 at 17:01
  • 1
    That is __not__ a lookbehind. – Kenneth K. Jun 08 '16 at 17:02
  • 1
    Personally, I would say exactly what [the documentation](https://docs.python.org/2/library/re.html#re.S) says: `Make the '.' special character match any character at all, including a newline`. – Kenneth K. Jun 08 '16 at 17:06