2

I'm test building a scraping site with django. For some reason the following code is only providing one picture image where i'd like it to print every image, every link, and every price, any help? (also, if you guys know how to place this data into a database model so I don't have to always scrape the site, i'm all ears but that may be another question) Cheers!

Here is the template file:

{% extends "base.html" %}

{% block title %}Boats{% endblock %}

{% block content %}

<img src="{{ fetch_boats }}"/>

{% endblock %}

Here is the views.py file:

#views.py
from django.shortcuts import render_to_response
from django.template.loader import get_template
from django.template import Context
from django.http import Http404, HttpResponse
from fetch_images import fetch_imagery

def fetch_it(request):
    fi = fetch_imagery()
    return render_to_response('fetch_image.html', {'fetch_boats' : fi})

Here is the fetch_images module:

#fetch_images.py
from BeautifulSoup import BeautifulSoup
import re
import urllib2

def fetch_imagery():
    response = urllib2.urlopen("http://www.boattrader.com/search-results/Type")
    html = response.read()

#create a beautiful soup object
    soup = BeautifulSoup(html)

#all boat images have attribute height=165
    images = soup.findAll("img",height="165")
    for image in images:
        return image['src'] #print th url of the image only

# all links to detailed boat information have class lfloat
    links = soup.findAll("a", {"class" : "lfloat"})
    for link in links:
        return link['href']
        #print link.string

# all prices are spans and have the class rfloat
    prices = soup.findAll("span", { "class" : "rfloat" })
    for price in prices:
        return price
        #print price.string

Lastly, if needed the mapped url in urlconf is below:

from django.conf.urls.defaults import *
from mysite.views import fetch_it

urlpatterns = patterns('', ('^fetch_image/$', fetch_it))
Diego
  • 795
  • 2
  • 13
  • 17

3 Answers3

2

Your fetch_imagery function needs some work - since you're returning (instead of using yield), the first return image['src'] will terminate the function call (I'm assuming here that all those returns are part of the same function definition as shown by your code).

Also, my assumption is that you will be returning a list/tuple (or defining a generator method) from fetch_imagery in which case your template needs to look like:

{% block content %}
    {% for image in fetch_boats %}
        <img src="{{ image }}" />
    {% endfor %}
{% endblock %}

This will basically loop over all items (image urls in your case) in your list and will create img tags for each one of them.

Rishabh Manocha
  • 2,955
  • 3
  • 18
  • 16
  • Thanks Rishabh, I hadn't seen the yield statement before (still rather newbie)... for anyone else, here's a great answer for the yield statement: http://stackoverflow.com/questions/231767/can-somebody-explain-me-the-python-yield-statement – Diego Jun 04 '10 at 11:39
2

Out of the scope, but to my mind, scrapping is an excessive cpu time / memory / bandwith consumming, and I think it should be done in a background in asynchronous maneer.

It's a great idea though :)

dzen
  • 6,923
  • 5
  • 28
  • 31
  • out of scope? how is this done asynchronously? The app i'd like to create requires real-time data as it is in kayak.com.. is that an asynchronous scraper? still learning*.. Thanks! – Diego Jun 04 '10 at 11:41
0

I dug around on the 'net for quite a while looking for an example for presenting scraped data and this post really helped. There've been some minor changes to the modules since the question was first posted, so I thought I'd bring it up to date and post the code I have with the changes that were needed.

What's nice about this is it gives an example of how to run some Python code in response to traffic, and generate simple content that doesn't have any reason to involve a database or Model classes.

Assuming you have a working Django project that you can add these changes to, you should be able to browse to <your-base-url>/fetch_boats and see a bunch of boat pictures.

views.py

import django.shortcuts
from django.shortcuts import render
from bs4 import BeautifulSoup
import urllib.request

def fetch_boats(request):
    fi = fetch_imagery()
    return render(request, "fetch_boats.html", {"boat_images": fi})

def fetch_imagery():
    response = urllib.request.urlopen("http://www.boattrader.com")
    html     = response.read()
    soup     = BeautifulSoup(html, features="html.parser")
    images   = soup.findAll("img")

    for image in images:
        yield image["src"]

urls.py

from django.urls import path
from .views import fetch_boats

urlpatterns = [
    path('fetch_boats', fetch_boats, name='fetch_boats'),
]

templates/fetch_boats.html

{% extends 'base.html' %}
{% block title %} ~~~&lt; Boats &gt;~~~ {% endblock title %}
{% block content %}

    {% for image in boat_images %}
        <br /><br />
        <img src="{{ image }}" />
    {% endfor %}

{% endblock content %}
Todd
  • 4,669
  • 1
  • 22
  • 30