1

I am using Python 3.5 and trying to scrape a list of urls (from the same website), code as follows:

import urllib.request
from bs4 import BeautifulSoup



url_list = ['URL1',
            'URL2','URL3]

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

# Scraping
def getPropNames():
    for propName in soup.findAll('div', class_="property-cta"):
        for h1 in propName.findAll('h1'):
            print(h1.text)

def getPrice():
    for price in soup.findAll('p', class_="room-price"):
        print(price.text)

def getRoom():
    for theRoom in soup.findAll('div', class_="featured-item-inner"):
        for h5 in theRoom.findAll('h5'):
            print(h5.text)


for soups in soup():
    getPropNames()
    getPrice()
    getRoom()

So far, if I print soup, get propNames, getPrice or getRoom they seem to work. But I can't seem to get it go through each of the urls and print getPropNames, getPrice and getRoom.

Only been learning Python a few months so would greatly appreciate some help with this please!

Maverick
  • 789
  • 4
  • 24
  • 45

2 Answers2

0

Just think what this code do:

def soup():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            return soup_maker

Let me show you an example:

def soup2():
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            return maker

And the output for url_list = ['one', 'two', 'three'] is:

one
('one', 'a')

Do you see now? What is going on?

Basically your soup function return on first return - do not return any iterator, any list; only the first BeautifulSoup - you are lucky (or not) that this is iterable :)

So change the code:

def soup3():
    soups = []
    for url in url_list:
        print(url)
        for thing in ['a', 'b', 'c']:
            print(url, thing)
            maker = 2 * thing
            soups.append(maker)
    return soups

And then output is:

one
('one', 'a')
('one', 'b')
('one', 'c')
two
('two', 'a')
('two', 'b')
('two', 'c')
three
('three', 'a')
('three', 'b')
('three', 'c')

But I believe that this also will not work :) Just wonder what is returned by sauce: sauce = urllib.request.urlopen(url) and actually on what your code is iterating on: for things in sauce - mean what the things is.

Happy coding.

opalczynski
  • 1,599
  • 12
  • 14
  • Thanks for that Sebastian Opałczyński, I'll take that on board, try to get my head around it, and let you know the outcome! – Maverick Feb 17 '17 at 14:03
0

Each of the get* functions uses a global variable soup which is not set correctly anywhere. Even if it were, it would not be a good approach. Make soup a function argument instead, e.g.:

def getRoom(soup):
    for theRoom in soup.findAll('div', class_="featured-item-inner"):
        for h5 in theRoom.findAll('h5'):
            print(h5.text)

for soup in soups():
    getPropNames(soup)
    getPrice(soup)
    getRoom(soup)

Secondly, you should being doing yield from soup() instead of return to turn it into a generator. Otherwise you would need to return a list of BeautifulSoup objects.

def soups():
    for url in url_list:
        sauce = urllib.request.urlopen(url)
        for things in sauce:
            soup_maker = BeautifulSoup(things, 'html.parser')
            yield soup_maker

I'd also suggest using XPath or CSS selectors to extract HTML elements: https://stackoverflow.com/a/11466033/2997179.

Community
  • 1
  • 1
Martin Valgur
  • 5,793
  • 1
  • 33
  • 45
  • Thank you Martin Valgur, that is insightful - I will look into Xpath/CSS. On application of your suggestion I am getting the following error message: AttributeError: 'function' object has no attribute 'findAll - any ideas? – Maverick Feb 17 '17 at 14:01
  • 1
    Did you add the `soup` parameter to all functions? I suggest also renaming the `soup()` function to `soups()`. – Martin Valgur Feb 17 '17 at 15:09
  • Thank you, that was were I was going wrong! However, it only seems to work for getPrice. The other 2 don't return anything? Strange as when I first wrote these functions I was using 1 url and they all worked perfectly. – Maverick Feb 17 '17 at 15:25