Regular Expression help

Question

So I have a string like this:

A!
B!
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>

Is it possible to parse this with a regex so I get this output (a list):

[A, B, C, [D, E], F, [G]]

Basically I'm looking for a way to split the string by ! and by the tag...and the tag part can happen anywhere...and multiple times (but not recursively...meaning a tag within a tag...this doesn't happen). The whole thing seems regular...is this even possible to do with regex?

EDIT: I am using Python

EDIT2: I am only using A, B, C... as a representation...those can be any string made out of letters and numbers

It should be possible. But in what language are you using the regex? Java, Perl, Javascript, etc? — Lukas Eder, Apr 07 '11 at 13:35
It just occurred to me - are you looking for an array output? I thought your sample output was a literal string, but now I think that was meant to represent a nested array. — Justin Morgan - On strike, Apr 07 '11 at 13:51
Obviously, you are looking for a [hidden Scanner functionality of re module](http://code.activestate.com/recipes/457664-hidden-scanner-functionality-in-re-module/) — janislaw, Apr 07 '11 at 14:24
Do you have any particular reason for favouring a regular expression over any different solution? It seems to me that you're deliberately excluding answers which may be more appropriate. — Andrew Aylett, Apr 07 '11 at 15:03
I am not "deliberatley excluding" anything...I am open to any "elegant" solution to the problem, since the input can be very long...regex was just the first thing that came to mind... — Veles, Apr 07 '11 at 16:07
Sorry to disappoint...it's not...it's part of a project I'm working on and needed a nice way to parse this kind of input... — Veles, Apr 08 '11 at 10:06

Justin Morgan - On strike · Answer 1 · 2011-04-07T13:53:26.483

1

I don't know Python, but you can do this with three simple regex-replaces (possibly doable as a single regex, but the following should work fine).

Javascript version:

str = '[' + str.replace(/!\n/, ', ').replace(/<[^\/>]*>/, '[').replace(/<\/[^>]*>/, ']') + ']';

Hopefully that will be understandable enough to translate to Python.

Edit: Are you looking for array output? I thought your sample output was a literal string, but now I think that was meant to represent a nested array.

edited Apr 07 '11 at 13:53

answered Apr 07 '11 at 13:45

Justin Morgan - On strike

30,035
12
80
104

Hmm the brackets of the output where supossed to represent a list (as a data type)...so I want the output to be a nested list! – Veles Apr 07 '11 at 13:48
@Icoo - Sorry, my mistake. In that case I'll have to leave it to someone else who knows Python. – Justin Morgan - On strike Apr 07 '11 at 13:54
i wish he was looking for json output :P – Bunny Rabbit Apr 08 '11 at 07:47

MPękalski · Answer 2 · 2011-04-07T17:20:51.227

1

Wouldn't it be easier to just replace <tag> with [ and </tag> with ], and !\n with ,, and at the end embrace everything in one more pair of []?

edited Apr 07 '11 at 17:20

answered Apr 07 '11 at 14:09

MPękalski

6,873
4
26
36

That would be a more "hackish" way to do this...I would rather like to do this with regular expressions for clarity... – Veles Apr 07 '11 at 14:31
@Icoo While I do love regular expressions, saying that I want to use them for clarity would be a bit of a stretch for me. Most times I find using split and replace gives more readable code. – Lauritz V. Thaulow Apr 07 '11 at 14:42
but we'd get a string this way , right ? what he wants is a list . – Bunny Rabbit Apr 08 '11 at 07:37

score 0 · Answer 3 · answered Apr 07 '11 at 13:48

Yes it is possible.

To generate a flat array, your regex would be quite hairy, involving backtracking. It would be very similar to a regex for splitting a CSV file while allowing quoted strings, where the <tag> / </tag> markers take the place of the quote marks, and the ! takes the place of the comma.

But you asked for a nested array structure, and in fact that makes things easier.

In order to get the nested array structure, you're going to need to do two separate split operations, which means doing two separate regex operations. You could do the first one as described above, but in fact, having to do two separate operations actually makes it easier for you because you can split out the sections embedded in the <tag>s in the first pass, and since you say there's no nested tags, that means you don't need to do any complex regex back-tracking.

Hope that helps.

Lauritz V. Thaulow · Accepted Answer · 2011-04-08T07:28:08.063

from collections import deque
from types import StringTypes

s = "A!\nB!\nC!\n<tag>\nD!\nE!\n</tag>\nF!\n<tag>\nG!\n</tag>"

def parse(parts):
    if type(parts) in StringTypes:
        parts = deque(parts.split("\n"))
    ret = []
    while parts:
        part = parts.popleft()
        if part[-1] == "!":
            ret.append(part[:-1])
        elif part == "<tag>":
            ret.append(parse(parts))
        elif part == "</tag>":
            return ret
    return ret

print parse(s)

I use a deque for speed because pop(0) would be very slow, and reversing the list and using pop() would make the function harder to read and understand.

I dare anyone to create a regexp doing the same, while also improving clarity!

(BTW, I think you could also use the pyparsing module to solve this problem, since it supports recursion.)

EDIT: Changed function to expect either string or deque as argument, simplifying invocation.

OK I decided to use this approach! – Veles Apr 08 '11 at 10:28 — Veles, Apr 08 '11 at 10:28

score 0 · Answer 5 · edited May 23 '17 at 12:11

0

Here is my solution to the problem. It uses regexp and some operations on the list.

import re
my_str = "A!\nB!\n<tag>\nC!\n</tag>\nD!\nE!\n<tag>\nF!\nG!\n</tag>\nH!\n"

x = re.findall("^(?:.|\n)+?(?=\n<tag>)",str) + re.findall("(?<=</tag>\n)(?:.|\n)+?(?=\n<tag>\n)",str) + re.findall("(?<=>\n)(?:[^>]|\n)+(?=\n)$",my_str)


y =[]
for elem in x:
    y += elem.split('\n')
x = re.findall("((?<=<tag>\n)(?:.|\n)+?(?=\n</tag>\n))",my_str)
for elem in x:
    y.append(elem.split('\n'))   

print y

it produces the output

['A!', 'B!', 'D!', 'E!', 'H!', ['C!'], ['F!', 'G!']]

I didn't have much time to test it, though.

I do not think there is an easier way of doing this, as there is no recursive regexp in Python, see SO thread.

Have a good night (my time-zone). ;)

Note: probably it could have been made nicer by including everything in one regexp by using xor (see XOR in Regexp) but I think it would loose readability.

edited May 23 '17 at 12:11

Community

1
1

answered Apr 07 '11 at 23:04

MPękalski

6,873
4
26
36

I have just noticed that the solution in your example did not include `!`. But that is a minor flaw. – MPękalski Apr 07 '11 at 23:06
Covering the built-in `str` type with your own global variable might not be such a good idea. Also, the [Python Style Guide](http://www.python.org/dev/peps/pep-0008/) says "_Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), booleans (and, or, not)_", and "_Use spaces around arithmetic operators_". – Lauritz V. Thaulow Apr 08 '11 at 07:42
@lazyr I have changed the name of the variable according to your comment. – MPękalski Apr 08 '11 at 08:57

score 0 · Answer 6 · answered Apr 08 '11 at 18:46

If all the conditions I understood are verified (for exemple: there is no character on a line before '<tag>' or before '</tag>' ; right ?) , the following code does the job, I think:

import re

RE = ('(\A\n*<tag>\n+)',
      '(\A\n*)',
      '(!\n*</tag>(?!\n*\Z)\n*)',
      '(!\n*</tag>\n*\Z)',
      '(!\n*<tag>\n+)',
      '(!\n*\Z)',
      '(!\n+)')

pat = re.compile('|'.join(RE))

def repl(mat, d = {1:"[['", 2:"['", 3:"'],'", 4:"']]", 5:"',['", 6:"']", 7:"','"}):
    return d[mat.lastindex]

ch =  .... # a string to parse
dh = eval(pat.sub(repl,ch))

applying:

ch1 = '''

A!
B!
C!
<tag>
D!


E!
</tag>
F!
<tag>
G!
</tag>


'''

ch2 = '''A!
B!
C!



<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>

H!

'''

ch3 = '''


A!
B!
C!
<tag>
D!
E!
</tag>
Fududu!gutuyu!!
<tag>
G!

</tag>

H!'''

ch4 = '''<tag>
A!
B!

</tag>
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!

</tag>

H!'''

import re

RE = ('(\A\n*<tag>\n+)',
      '(\A\n*)',
      '(!\n*</tag>(?!\n*\Z)\n*)',
      '(!\n*</tag>\n*\Z)',
      '(!\n*<tag>\n+)',
      '(!\n*\Z)',
      '(!\n+)')

pat = re.compile('|'.join(RE))

def repl(mat, d = {1:"[['", 2:"['", 3:"'],'", 4:"']]", 5:"',['", 6:"']", 7:"','"}):
    return d[mat.lastindex]


for ch in (ch1,ch2,ch3,ch4):
    print ch
    dh = eval(pat.sub(repl,ch))
    print dh,'\n',type(dh)
    print '\n\n============================='

result

>>> 


A!
B!
C!
<tag>
D!


E!
</tag>
F!
<tag>
G!
</tag>



['A', 'B', 'C', ['D', 'E'], 'F', ['G']] 
<type 'list'>


=============================
A!
B!
C!



<tag>
D!
E!
</tag>
F!
<tag>
G!
</tag>

H!


['A', 'B', 'C', ['D', 'E'], 'F', ['G'], 'H'] 
<type 'list'>


=============================



A!
B!
C!
<tag>
D!
E!
</tag>
Fududu!gutuyu!!
<tag>
G!

</tag>

H!
['A', 'B', 'C', ['D', 'E'], 'Fududu!gutuyu!', ['G'], 'H'] 
<type 'list'>


=============================
<tag>
A!
B!

</tag>
C!
<tag>
D!
E!
</tag>
F!
<tag>
G!

</tag>

H!
[['A', 'B'], 'C', ['D', 'E'], 'F', ['G'], 'H'] 
<type 'list'>


=============================
>>>

Regular Expression help

6 Answers6