0

I've encountered a problem while trying to parse a complicated string. The string is really long and full of patterns but lets focus on what i need to take (and only that).

A substring from the huge string is:

... [span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" ...

Now I want to take the bold italic text, and i have the pattern, starts with [span class = ..]*[/span] desired text [div ... ] and this pattern repeates through the whole string.

How exactly do I take this specific text from the whole string and write it line after line?

Jay Kominek
  • 8,674
  • 1
  • 34
  • 51
Eran
  • 125
  • 1
  • 11
  • Do you really want to parse this with regex? It looks like it's just HTML with the angle brackets changed into square brackets and the quotes escaped, and [the same reasons that make regex bad for HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) will almost certainly make regex bad for this language. – abarnert May 06 '15 at 23:31
  • Actually, from a comment, it sounds like what you have really _is_ just HTML. – abarnert May 06 '15 at 23:32

3 Answers3

2

This pattern should fetch you the string, just grab the Group 1 value:

r'\[span\b[^]]*class=[\\"\']*review-title\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Or a more generic one that does not check the class="review-link":

r'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b'

Sample code at IDEONE:

import re
p = re.compile(ur'\[span\b[^]]*][^[]*\[/span\]\s*([^[]*)\[div\b')
test_str = u"[span class=\"review-title\"]Wont open[/span] I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. [div class=\"review-link\" "
print re.search(p, test_str).group(1)

Output:

I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server.

EDIT: Since the [s and ]s are in fact <s and >s, here is an updated regex and code:

import re
p = re.compile(ur'<span\b[^>]*>[^<]*</span>\s*([^<]*)<div\b')
test_str = u"<span class=\"review-title\">Wont open</span> I have the GS5 and the game wont open. I got this game when i got the first droid. The fact that people havent been able to play since almost 2013 is bull. Please fix this or there isnt a point in even having the game on the server. <div class=\"review-link\" "
print [x.group(1) for x in re.finditer(p, test_str)]

A more specific regex to account for the class attribute:

p = re.compile(ur'<span\b[^>]*class\s*=\s*[\\\'"]*review-title[^>]*>[^<]*</span>\s*([^<]*)<div\b')
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • hey it works great but just a little thing im having trouble to solve, the original [, ] are <, >. couldnt write it in the post. can you rewrite the regex please? – Eran May 06 '15 at 22:50
  • works perfect, but when i run it on the whole text, it just gives me the first result.. i need to take all those texts and write them into a string or list of strings – Eran May 06 '15 at 23:00
  • Use `finditer`, I have updated the **EDIT** section and the links to the corresponding demo programs. – Wiktor Stribiżew May 06 '15 at 23:00
  • :-D Always glad to help people out. Time to go to bed for me. Happy programming! – Wiktor Stribiżew May 06 '15 at 23:07
1

From your comments ("im having trouble to solve, the original [, ] are <, >"), it's pretty clear that what you have is HTML.

Do not try to parse HTML with regex.

What you want here is an HTML parser. For example:

from bs4 import BeautifulSoup

soup = BeautifulSoup(huge_string)
for span in soup.find_all('span', class='review-title'):
    text = span.next_sibling
    print(text)

Even if what you have is HTML escaped in some way (backslash-escaped quotes, angle brackets turned into square brackets, etc.), you still don't want to parse it with regex. In that case, at most, you might want to use a regex as the preprocessor to turn it back into HTML to feed to an HTML parser.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

It seems that you need just this regex:

(?<=\[/span\])[\s\S]*?(?=\[div)
Andie2302
  • 4,825
  • 4
  • 24
  • 43