How to combine all 3 in 1 re.findall() ??(python 2.7 && Regular Expressions)

Question

Filter1=re.findall(r'<span (.*?)</span>',PageSource) 
Filter2=re.findall(r'<a href=.*title="(.*?)" >',PageSource) 
Filter3=re.findall(r'<span class=.*?<b>(.*?)</b>.*?',PageSource)

how to do it in 1 line code ...like this:

Filter=re.findall(r'  ',PageSource)

I tried this way:

Filter=re.findall(r'<span (.*?)</span>'+
                  r'<a href=.*title="(.*?)" >'+
                  r'<span class=.*?<b>(.*?)</b>.*?',PageSource)

But it is not working.

In generall, its not a good idea to use regex for html. More [here](http://stackoverflow.com/a/1732454/248823). — Marcin, Feb 19 '15 at 23:17
[scrapy](http://scrapy.org/) or [beatifulsoup](http://www.crummy.com/software/BeautifulSoup/) are often use to work with html. — Marcin, Feb 19 '15 at 23:30
If you want to use regular expressions, use `|` between your three regexen: `||(.*?)`. Although, don't use regular expressions. — kindall, Feb 19 '15 at 23:50
it's showing this error "unsupported operand type(s) for |: 'str' and 'str' " @kindall — Nurul Akter Towhid, Feb 19 '15 at 23:53

alecxe · Accepted Answer · 2015-02-19T23:31:12.387

How about using an HTML Parser instead?

Example, using BeautifulSoup:

from bs4 import BeautifulSoup

data = "your HTML here"
soup = BeautifulSoup(data)

span_texts = [span.text for span in soup.find_all('span')]
a_titles = [a['title'] for a in soup.find_all('a', title=True)]
b_texts = [b.text for b in soup.select('span[class] > b')]

result = span_texts + a_titles + b_texts

Demo:

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <div>
...     <span>Span's text</span>
...     <a title="A title">link</a>
...     <span class="test"><b>B's text</b></span>
... </div>
... """
>>> soup = BeautifulSoup(data)
>>> 
>>> span_texts = [span.text for span in soup.find_all('span')]
>>> a_titles = [a['title'] for a in soup.find_all('a', title=True)]
>>> b_texts = [b.text for b in soup.select('span[class] > b')]
>>> 
>>> result = span_texts + a_titles + b_texts
>>> print result
[u"Span's text", u"B's text", 'A title', u"B's text"]

Aside from that, your regular expressions are pretty different and serve different purposes - I would not try to squeeze unsqueezable, keep them separate and combine the results into a single list.

I never use BeautifulSoup . I'll try it. – Nurul Akter Towhid Feb 19 '15 at 23:30 — Nurul Akter Towhid, Feb 19 '15 at 23:30

How to combine all 3 in 1 re.findall() ??(python 2.7 && Regular Expressions)

1 Answers1