-1

I am trying to do web scraping using python. I am trying to get the link for the product which is (my goal)

http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

I am scraping this url / site

 http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream

if you do a page view you will see that there is no certain ids or tags that can help me to pin point to the url that I need and I am not really good with regex as well. I have this so far in python

import urllib
import re
product = "3-Piece Reversible Bonded Leather Match Sofa Set in Cream"
productSearchUrl = product.replace(" ","+");
myurl = "http://www.fastfurnishings.com/SearchResults.asp?Search="+productSearchUrl
print myurl
htmlfile = urllib.urlopen(myurl)
htmltext = htmlfile.read()
regex = '<td valign="top" width="33%" align="center">(.+?)</td> '
r = re.compile(regex)
print re.findall(r,htmltext)

but thats not reading anything...any help will be appreciated

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
Asim Zaidi
  • 27,016
  • 49
  • 132
  • 221
  • Please see the answer with 4000 votes to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Hyperboreus Sep 13 '13 at 06:28
  • Use `re.DOTALL` to make the `.` match newlines. – Jerry Sep 13 '13 at 06:30

3 Answers3

3

You are better off using a web scraper library such as Scrapy or BeautifulSoup. Will definitely save you a lot of pain and will make you focus on what you actually want to achieve after scraping the information.

Prahalad Deshpande
  • 4,709
  • 1
  • 20
  • 22
3

This is why you use HTML Parsers such as BeautifulSoup:

>>> import urllib2
>>> from bs4 import BeautifulSoup as BS
>>> html = urllib2.urlopen('http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream')
>>> soup = BS(html)
>>> print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href']
http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

See how easy that was ;)

TerryA
  • 58,805
  • 11
  • 114
  • 143
  • you are sooooo awesome! – Asim Zaidi Sep 13 '13 at 06:51
  • one quick question what if print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href'] doesnt exist. It will then throw an exception. How can I avoid that. – Asim Zaidi Sep 13 '13 at 07:06
  • @Autolycus Replace the `['href']` with `.get('href')`. If a `href` doesn't exist, it will return `None` – TerryA Sep 13 '13 at 07:09
  • this didnt work print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).get('href') what am I doing wrong? – Asim Zaidi Sep 13 '13 at 07:13
  • @Autolycus You forgot the `.a` – TerryA Sep 13 '13 at 07:14
  • @Autolycus So, altogether, the correct code is `print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a.get('href')` – TerryA Sep 13 '13 at 07:14
  • no go on that..this is my stacktrace Traceback (most recent call last): File "asimsScrapper.py", line 11, in print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a.get('href') AttributeError: 'NoneType' object has no attribute 'a' – Asim Zaidi Sep 13 '13 at 07:16
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/37305/discussion-between-autolycus-and-haidro) – Asim Zaidi Sep 13 '13 at 07:17
0

Dont do this, etc. Looks like there are newlines you're not accounting for:

r = re.compile(regex, re.DOTALL)
Thomas
  • 6,515
  • 1
  • 31
  • 47