Python regex find two groups

Question

>>> text = '<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>'

>>> import re
>>> re.findall(r'data-lecture-id="(\d+)"|(.*)</a>',a)
>>> [('47', ''), ('', 'Another diversion: The softmax output function [7 min]')]

How do i extract the data out like this:

>>> ['47', 'Another diversion: The softmax output function [7 min]']

I think there should be some smarter regex expressions.

Is there a reason it has to be a smarter regex, rather than, say, not using a regex in the first place? — abarnert, Mar 27 '13 at 07:52

Serdalis · Answer 1 · 2013-03-27T08:09:56.163

2

you use itertools

import re
from itertools import chain, ifilter

raw_found = re.findall(r'data-lecture-id="(\d+)"|(.*)</a>', text)

# simple
found = [x for x in chain(*raw_found) if x]

# or faster
found = [x for x in ifilter(None, chain(*raw_found))]

# or more compact, also just as fast
found = list(ifilter(None, chain(*raw_found)))

print found

Output:

['47', 'Another diversion: The softmax output function [7 min]']

edited Mar 27 '13 at 08:09

answered Mar 27 '13 at 07:34

Serdalis

10,296
2
38
58

I know some people hate `filter(None, it)`, but I think it's more readable than `[x for x in it if x]`. (Not a complaint/correction/whatever; the OP should know how to read/write it both ways.) – abarnert Mar 27 '13 at 07:52
@abarnert Honestly I've never seen that used before, I must admit it seems more pythonic, I'll have to research the advantages / disadv of both. or `itertools.ifilter` definitely sexy there. – Serdalis Mar 27 '13 at 07:54
Well, the main disadvantage is that not everyone knows what it means. There's also the fact that many people who come from certain functional languages thing it's a bastardization of what `filter` should mean, while many who don't come from those languages hate `filter` (and `map` and `reduce`) in the first place. The only advantage is that it's more concise, and easier to read if you already know what it means. – abarnert Mar 27 '13 at 07:59

score 2 · Answer 2 · edited May 23 '17 at 12:11

It is not recommended to parse HTML with reguar expressions. You can give a try to the xml.dom.minidom module:

from xml.dom.minidom import parseString

xml = parseString('<a data-lecture-id="47"\n   data-modal-iframe="https://class.coursera.org/neuralnets-2012-001/lecture/view?lecture_id=47"\n   href="https://class.coursera.org/neuralnets-2012-001/lecture/47"\n   data-modal=".course-modal-frame"\n   rel="lecture-link"\n   class="lecture-link">\nAnother diversion: The softmax output function [7 min]</a>')
anchor = xml.getElementsByTagName("a")[0]
print anchor.getAttribute("data-lecture-id"), anchor.childNodes[0].data

score 0 · Answer 3 · answered Mar 27 '13 at 07:44

0

I find a solution myself:

>>> re.findall('r'data-lecture-id="(\d+)"[\s\S]+>([\s\S]+)</a>',a)
>>> [('47', '\nAnother diversion: The softmax output function [7 min]')]

Looks better, but still have to iterate it to extract a simple list...

answered Mar 27 '13 at 07:44

WoooHaaaa

19,732
32
90
138

1

If you want to "flatten" a two-deep sequence like this, that's `itertools.chain.from_iterable(x)` (or, if it's an actual sequence rather than an arbitrary iterable, just `itertools.chain(*x)`). Serdalis's answer already explains this. – abarnert Mar 27 '13 at 07:54

Python regex find two groups

3 Answers3