Regex to extract multiline hash comments

Question

Currently suffering from writers block trying to come up with an elegant solution to this problem.

Take the following example:

{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}

From the above, I'd like to extract the code comments together as a group, rather than individually. This grouping would happen if a line was commented right after another line. Comments will always start with whitespace followed by a #.

Example result:

Capture group 1: Some information about field 1\n on multiple lines
Capture group 2: Some more info on a single line

I could step over the lines and evaluate in code, but it would be nice to use a regex if possible. If you feel like a regex is not the correct solution for this problem, please explain your reasons why.

SUMMARY:

Thank you to everyone for submitting various solutions to this problem, this is a prime example of just how helpful the SO community can be. I will be spending an hour of my own time answering other tickets to make up for the collective time spent on this.

Hopefully this thread will help others in future too.

It can be done, but the regexp itself would depend on how you want to gather the data. Are you skilled with regexps? In any case, are you asking if regexps are a good way to do it, or HOW to do it? — SebasSBM, May 06 '15 at 20:24
Although I can construct and understand some regex, I wouldn't say it is one of my strong points, for example I don't understand how negation matches work. I'm asking whether or not regex would be a good solution, and if so, what the regex would be. — SleepyCal, May 06 '15 at 20:26
Perhaps you want to fragment the regexp match and store each fragment in its own variable. This can be done in just a few lines. Do you want an example? — SebasSBM, May 06 '15 at 20:29
Sure, any pointers in the right direction would be appreciated. — SleepyCal, May 06 '15 at 20:31

Mazdak · Accepted Answer · 2015-05-06T21:29:42.147

You can use re.findall with following regex :

>>> m= re.findall(r'\s*#(.*)\s*#(.*)|#(.*)[^#]*',s,re.MULTILINE)
[(' Some information about field 1', ' on multiple lines', ''), ('', '', ' Some more info on a single line')]

And for print you can do :

>>> for i,j in enumerate(m):
...   print ('group {}:{}'.format(i," & ".join([i for i in j if i])))
... 
group 0: Some information about field 1 &  on multiple lines
group 1: Some more info on a single line

But as a more general way for comment lines more that 2 you can use itertools.groupby :

s="""{
  "data": {
    # Some information about field 1
    # on multiple lines
    # threeeeeeeeecomment
    "field1": "XXXXXXXXXX"

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",

    "field3": "#this would be ignored"
  }
}"""
from itertools import groupby

comments =[[i for i in j if i.strip().startswith('#')] for _,j in groupby(s.split('\n'),lambda x: x.strip().startswith('#'))]

for i,j in enumerate([m for m in comments if m],1):
        l=[t.strip(' #') for t in j]
        print 'group {} :{}'.format(i,' & '.join(l))

result :

group 1 :Some information about field 1 & on multiple lines & threeeeeeeeecomment
group 2 :Some more info on a single line

I have absolutely no idea how this works, but it works perfectly. Nice work, thank you for spending time on this, very appreciated. — SleepyCal, May 06 '15 at 20:50
But that would not work if there are three lines of comments in a row, would it? — SanD, May 06 '15 at 20:50
Ah yeah, it breaks if there is more than 2 lines. Is that a fairly easy thing to fix? — SleepyCal, May 06 '15 at 20:56

score 1 · Answer 2 · edited May 23 '17 at 11:52

Let's say, for example, you want to take some specific data from a multiline string on each line with a single regexp (for example, hashtags):

#!/usr/bin/env python
# coding: utf-8

import re

# the regexp isn't 100% accurate, but you'll get the point
# groups followed by '?' match if repeated 0 or 1 times.
regexp = re.compile('^.*(#[a-z]*).*(#[a-z]*)?$')

multiline_string = '''
                     The awesomeness of #MotoGP is legendary. #Bikes rock!
                     Awesome racing car #HeroComesHome epic
'''

iterable_list = multiline_string.splitlines()

for line in iterable_list:
    '''
    Keep in mind:   if group index is out of range,
                    execution will crash with an error.
                    You can prevent it with try/except blocks
    '''
    fragments = regexp.match(line)
    frag_in_str = fragments.group(1)

    # Example to prevent a potential IndexError:
    try:
        some_other_subpattern = fragments.group(2)
    except IndexError:
        some_other_subpattern = ''

    entire_match = fragments.group(0)

Every group inside parenthesis may be extracted this way.

A good example to negate patterns has been posted here: How to negate specific word in regex?

Thank you for having a shot at this, voted up for the effort. Though it seems that the answer from @kasra does exactly what I was looking for — SleepyCal, May 06 '15 at 20:49
I parsed the regexp rushing, so it was wrong before. I've corrected it — SebasSBM, May 06 '15 at 20:56

dawg · Answer 3 · 2015-05-06T21:23:33.983

You can use a deque to keep two lines and add some logic to partition the comments in blocks:

src='''\
{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX",


    # multiple line comments
    # supported
    # as well 
    "field3": "#this would be ignored"

  }
}
'''

from collections import deque
d=deque([], 2)
blocks=[]
for line in src.splitlines():
    d.append(line.strip())
    if d[-1].startswith('#'):        
        comment=line.partition('#')[2]
        if d[0].startswith('#'):
            block.append(comment)
        else:
            block=[comment]
    elif d[0].startswith('#'):
        blocks.append(block)

for i, b in enumerate(blocks):
    print 'block {}: \n{}'.format(i, '\n'.join(b))

Prints:

block 0: 
 Some information about field 1
 on multiple lines
block 1: 
 Some more info on a single line
block 2: 
 multiple line comments
 supported
 as well

Thank you for spending time on this, I've already accepted another answer above but this is a good solution too. Upvoting. — SleepyCal, May 07 '15 at 21:26

score 1 · Answer 4 · answered May 06 '15 at 21:51

It's impossible to do purely with regexes, but you can get away with a one-liner)

import re

str = """{
  "data": {
    # Some information about field 1
    # on multiple lines
    "field1": "XXXXXXXXXX",

    # Some more info on a single line
    "field2": "XXXXXXXXXXX"
    # Some information about field 1
    # on multiple lines
    # Some information about field 1
    # on multiple lines
    "field3": "#this would be ignored"
  }
}"""

rex = re.compile("(^(?!\s*#.*?[\r\n]+)(.*?)([\r\n]+|$)|[\r\n]*^\s*#\s*)+", re.MULTILINE)    
print rex.sub("\n", str).strip().split('\n\n')

Outputs:

['Some information about field 1\non multiple lines', 'Some more info on a single line', 'Some information about field 1\non multiple lines\nSome information about field 1\non multiple lines']

Nice solution, I've already accepted a different answer but I appreciate you spending time on this, upvoting. — SleepyCal, May 07 '15 at 21:26

Regex to extract multiline hash comments

4 Answers4