1

I have written following regex But its not working. Can you please help me? thank you :-)

track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
            <p>
            </p>
            <p> Artist(s) David: <br/>
              Music: Ramana Gogula<br/>
            </p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)

Output Should be:

Artist(s) David
Music: Ramana Gogula
  • Seriously prefer an alternative to regex. – FailedDev Nov 17 '11 at 15:23
  • The obligatory reference is here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – bgporter Nov 17 '11 at 15:24
  • I suppose with such badly formatted HTML, even a parser isn't going to help you too much, although you may as well use one at least for extracting the text from the HTML. – Acorn Nov 17 '11 at 15:40

3 Answers3

1

You were ignoring the whitespace:

<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>

Output is:

[1] => "David"
[2] => "Ramana Gogula"

(note that your regex didn't match the Artists(s) and Music: prefixes either)


However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).

Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see @bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).

Regexident
  • 29,441
  • 10
  • 93
  • 100
  • one big problem is that he's trying to match `Artist(s): David`..., while his source has the text `Artist(s) David:`... – Code Jockey Nov 17 '11 at 15:25
  • @Regexident Thanks But It is displaying something like this. `<_sre.SRE_Match object at 0x01FFD4E8>` –  Nov 17 '11 at 15:26
  • @CodeJockey: Yes, absolutely. – Regexident Nov 17 '11 at 15:27
  • @no_access: seriously, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. – Regexident Nov 17 '11 at 15:29
1
import lxml.html as lh
import re

track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''

tree = lh.fromstring(track_desc)

print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())
Acorn
  • 49,061
  • 27
  • 133
  • 172
0

I see a few errors:

  • regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
  • spaces are not taken into account
  • artist(s) is not followed by :

As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.

Note, following seems to work:

rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())
Bruce
  • 7,094
  • 1
  • 25
  • 42