0

I have a bunch of files in a directory which have hashes as below:

%some_hash = 
(
...
)

%some_other_hash = 
(
...
)

along with a bunch of other stuff in the file. I am listing the files in that directory and reading them in a loop. I want to extract only the above data, everything in the brackets alongwith the %word before it. Of course there can be brackets inside as well. Basic regexes i tried do not work. They split text in between since it finds a bracket.

I am using re.findall so i get everything for a file in a list.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
user775093
  • 699
  • 2
  • 11
  • 22

1 Answers1

0

Here is a regex that should work for you:

%(?P<hash_string>[a-zA-Z_]+)\s?=(?:\s+)?(?P<hash_value>\(.*?\))

You also need to use the re.DOTALL flag when compiling the regex. This is done to make sure that . or \s matches newlines \r or \r\n

You can find an explanation of the regex here: https://regex101.com/r/wB5eH9/4

Here is an example:

>>> import re
>>> pattern = re.compile('%(?P<hash_string>[a-zA-Z_]+)\s?=(?:\s+)?(?P<hash_value>\(.*?\))', re.DOTALL)
>>> data = """
... %some_hash = 
... (
... ...
... )
... 
... %some_other_hash = 
... (
... ...
... )"""
... 
>>> pattern.findall(data)
[('some_hash', '(\n...\n)'), ('some_other_hash', '(\n...\n)')]
ashwinjv
  • 2,787
  • 1
  • 23
  • 32
  • This thing stops when it finds the first closing parenthesis. I should have mentioned this in the example as well... – user775093 Nov 19 '15 at 03:58