Extract artist and music From text (regex)

Question

I have written following regex But its not working. Can you please help me? thank you :-)

track_desc = '''<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
            <p>
            </p>
            <p> Artist(s) David: <br/>
              Music: Ramana Gogula<br/>
            </p>'''
rx = "<p><\/p><p>Artist\(s\): (.*?)<br\/>Music: (.*?)<br\/><\/p>"
m = re.search(rx, track_desc)

Output Should be:

Artist(s) David
Music: Ramana Gogula

The obligatory reference is here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — bgporter, Nov 17 '11 at 15:24
I suppose with such badly formatted HTML, even a parser isn't going to help you too much, although you may as well use one at least for extracting the text from the HTML. — Acorn, Nov 17 '11 at 15:40

Regexident · Accepted Answer · 2011-11-17T15:31:46.547

1

You were ignoring the whitespace:

<p>[\s\n\r]*Artist\(s\)[\s\n\r]*(.*?)[\s\n\r]*:[\s\n\r]*<br/>[\s\n\r]*Music:[\s\n\r]*(.*?)<br/>[\s\n\r]*</p>

Output is:

[1] => "David"
[2] => "Ramana Gogula"

(note that your regex didn't match the Artists(s) and Music: prefixes either)

However for production code I would not rely on such rather clumsy regex (and equally clumsily formatted HTML source).

Seriously though, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. Parsing HTML with Regex is wrong in 9 out of 10 cases (see @bgporter's comment link) and doomed to fail. Apart from that HTML is hardly ever an appropriate data source (unless there really really is no alternative source).

edited Nov 17 '11 at 15:31

answered Nov 17 '11 at 15:16

Regexident

29,441
10
93
100

one big problem is that he's trying to match `Artist(s): David`..., while his source has the text `Artist(s) David:`... – Code Jockey Nov 17 '11 at 15:25
@Regexident Thanks But It is displaying something like this. `<_sre.SRE_Match object at 0x01FFD4E8>` – Nov 17 '11 at 15:26
@CodeJockey: Yes, absolutely. – Regexident Nov 17 '11 at 15:27
@no_access: seriously, ditch the idea of using regex for this if you aren't the slightest familiar with regex (which it looks like). You're using the wrong tool and a badly formatted data source. – Regexident Nov 17 '11 at 15:29

score 1 · Answer 2 · answered Nov 17 '11 at 15:36

import lxml.html as lh
import re

track_desc = '''
<img src="http://images.raaga.com/catalog/cd/A/A0000102.jpg" align="right" border="0" width="100" height="100" vspace="4" hspace="4" />
<p>
</p>
<p> Artist(s) David: <br/>
Music: Ramana Gogula<br/>
</p>
'''

tree = lh.fromstring(track_desc)

print re.findall(r'Artist\(s\) (.+):\s*\nMusic: (.*\w)', tree.text_content())

score 0 · Answer 3 · answered Nov 17 '11 at 15:35

I see a few errors:

regex is not multiline : should use flags=re.MULTILINE to allow to match on multilines
spaces are not taken into account
artist(s) is not followed by :

As the web page is rather strangely presented, this might be error prone to rely on a regex and I wouldn't advise to use it extensively.

Note, following seems to work:

rx='Artist(?:\(s\))?\s+(.*?)\<br\/>\s+Music:\s*(.*?)\<br'
print ("Art... : %s && Mus... : %s" % re.search(rx, track_desc,flags=re.MULTILINE).groups())

Extract artist and music From text (regex)

3 Answers3