Advanced string parsing in python

Question

I've encountered a problem while trying to parse a complicated string. The string is really long and full of patterns but lets focus on what i need to take (and only that).

A substring from the huge string is:

... [span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" ...

Now I want to take the bold italic text, and i have the pattern, starts with [span class = ..]*[/span] desired text [div ... ] and this pattern repeates through the whole string.

How exactly do I take this specific text from the whole string and write it line after line?

Do you really want to parse this with regex? It looks like it's just HTML with the angle brackets changed into square brackets and the quotes escaped, and [the same reasons that make regex bad for HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) will almost certainly make regex bad for this language. — abarnert, May 06 '15 at 23:31
Actually, from a comment, it sounds like what you have really _is_ just HTML. — abarnert, May 06 '15 at 23:32

Wiktor Stribiżew · Accepted Answer · 2015-05-06T23:01:42.520

This pattern should fetch you the string, just grab the Group 1 value:

r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Or a more generic one that does not check the class="review-link":

r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Sample code at IDEONE:

import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)

Output:

I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.

EDIT: Since the [s and ]s are in fact <s and >s, here is an updated regex and code:

import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]

A more specific regex to account for the class attribute:

p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')

hey it works great but just a little thing im having trouble to solve, the original [, ] are <, >. couldnt write it in the post. can you rewrite the regex please? — Eran, May 06 '15 at 22:50
works perfect, but when i run it on the whole text, it just gives me the first result.. i need to take all those texts and write them into a string or list of strings — Eran, May 06 '15 at 23:00
Use `finditer`, I have updated the **EDIT** section and the links to the corresponding demo programs. — Wiktor Stribiżew, May 06 '15 at 23:00
:-D Always glad to help people out. Time to go to bed for me. Happy programming! — Wiktor Stribiżew, May 06 '15 at 23:07

score 1 · Answer 2 · edited May 23 '17 at 12:28

From your comments ("im having trouble to solve, the original [, ] are <, >"), it's pretty clear that what you have is HTML.

Do not try to parse HTML with regex.

What you want here is an HTML parser. For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
    text = span.next_sibling
    print(text)

Even if what you have is HTML escaped in some way (backslash-escaped quotes, angle brackets turned into square brackets, etc.), you still don't want to parse it with regex. In that case, at most, you might want to use a regex as the preprocessor to turn it back into HTML to feed to an HTML parser.

Andie2302 · Answer 3 · 2015-05-06T22:47:36.733

0

It seems that you need just this regex:

(?<=\[/span\])[\s\S]*?(?=\[div)

edited May 06 '15 at 22:47

answered May 06 '15 at 22:42

Andie2302

4,825
4
24
43

Advanced string parsing in python

3 Answers3