I'm working on a project where I need to a bit of scraping. The project is on Google App Engine, and we're currently using Python 2.5. Ideally, we would use PyQuery but due to running on App Engine and Python 2.5, this is not an option.
I've seen questions like this one on finding an HTML tag with certain text, but they don't quite hit the mark.
I have some HTML that looks like this:
<div class="post">
<div class="description">
This post is about <a href="http://www.wikipedia.org">Wikipedia.org</a>
</div>
</div>
<!-- More posts of similar format -->
In PyQuery, I could do something like this (as far as I know):
s = pq(html)
s(".post:contains('This post is about Wikipedia.org')")
# returns all posts containing that text
Naively, I had though that I could do something like this in BeautifulSoup:
soup = BeautifulSoup(html)
soup.findAll(True, "post", text=("This post is about Google.com"))
# []
However, that yielded no results. I changed my query to use a regular expression, and got a bit further, but still no luck:
soup.findAll(True, "post", text=re.compile(".*This post is about.*Google.com.*"))
# []
It works if I omit Google.com
, but then I need to do all the filtering manually. Is there anyway to emulate :contains
using BeautifulSoup?
Alternatively, is there some PyQuery-like library that works on App Engine (on Python 2.5)?