learning python regex and webscraping and stuck

Question

I am trying to do web scraping using python. I am trying to get the link for the product which is (my goal)

http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

I am scraping this url / site

 http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream

if you do a page view you will see that there is no certain ids or tags that can help me to pin point to the url that I need and I am not really good with regex as well. I have this so far in python

import urllib
import re
product = "3-Piece Reversible Bonded Leather Match Sofa Set in Cream"
productSearchUrl = product.replace(" ","+");
myurl = "http://www.fastfurnishings.com/SearchResults.asp?Search="+productSearchUrl
print myurl
htmlfile = urllib.urlopen(myurl)
htmltext = htmlfile.read()
regex = '<td valign="top" width="33%" align="center">(.+?)</td> '
r = re.compile(regex)
print re.findall(r,htmltext)

but thats not reading anything...any help will be appreciated

Please see the answer with 4000 votes to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Hyperboreus, Sep 13 '13 at 06:28

score 3 · Answer 1 · answered Sep 13 '13 at 06:29

3

You are better off using a web scraper library such as Scrapy or BeautifulSoup. Will definitely save you a lot of pain and will make you focus on what you actually want to achieve after scraping the information.

answered Sep 13 '13 at 06:29

Prahalad Deshpande

4,709
1
20
22

score 3 · Accepted Answer · answered Sep 13 '13 at 06:34

3

This is why you use HTML Parsers such as BeautifulSoup:

>>> import urllib2
>>> from bs4 import BeautifulSoup as BS
>>> html = urllib2.urlopen('http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream')
>>> soup = BS(html)
>>> print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href']
http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

See how easy that was ;)

answered Sep 13 '13 at 06:34

TerryA

58,805
11
114
143

you are sooooo awesome! – Asim Zaidi Sep 13 '13 at 06:51
one quick question what if print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href'] doesnt exist. It will then throw an exception. How can I avoid that. – Asim Zaidi Sep 13 '13 at 07:06
@Autolycus Replace the `['href']` with `.get('href')`. If a `href` doesn't exist, it will return `None` – TerryA Sep 13 '13 at 07:09
this didnt work print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).get('href') what am I doing wrong? – Asim Zaidi Sep 13 '13 at 07:13
@Autolycus You forgot the `.a` – TerryA Sep 13 '13 at 07:14
@Autolycus So, altogether, the correct code is `print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a.get('href')` – TerryA Sep 13 '13 at 07:14
no go on that..this is my stacktrace Traceback (most recent call last): File "asimsScrapper.py", line 11, in print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a.get('href') AttributeError: 'NoneType' object has no attribute 'a' – Asim Zaidi Sep 13 '13 at 07:16
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37305/discussion-between-autolycus-and-haidro) – Asim Zaidi Sep 13 '13 at 07:17

score 0 · Answer 3 · answered Sep 13 '13 at 06:36

0

Dont do this, etc. Looks like there are newlines you're not accounting for:

r = re.compile(regex, re.DOTALL)

answered Sep 13 '13 at 06:36

Thomas

6,515
1
31
47

learning python regex and webscraping and stuck

3 Answers3