Web scraping multiple pages with Beautiful Soup

Question

I am writing some code to scrape some data from crowdcube.

The idea is to get the information Title, Description, target capital, raised capital and category

First I made an attempt on a single page. The code worked. Here it is:

from bs4 import BeautifulSoup
import urllib, re

data = {
        'title' : [],
        'description' : [],
        'target' : [],
        'raised':[],
        'category' : []
}

l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884')
    tree= BeautifulSoup(l, 'lxml')

#title
    title=tree.find_all('div',{'class':'cc-pitch__title'})

    data['title'].append(title[0].find('h2').get_text())    


#description
    description=tree.find_all('div',{'class':'fullwidth'})

    data['description'].append(description[1].find('p').get_text())

#target

    target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})

    data['target'].append(target[0].find('dd').get_text())

#raised

    raised=tree.find_all('div',{'class':'cc-pitch__raised'})

    data['raised'].append(raised[0].find('b').get_text())


#category

    category=tree.find_all('li',{'class':'sectors'})

    data['category'].append(category[0].find('span').get_text() )

data

I need to download the same information from all the projects on the website.

All links are included in this page: (https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7)

To do so, I started creating a list of URLs with this code:

source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7')

get_link= BeautifulSoup(source, 'lxml')

links_page = [a.attrs.get('href') for a in get_link.select('a[href]')]

links_page = list(set(links_page)) #drops duplicates
links = [l for l in links_page if 'https://www.crowdcube.com/investment/' in l] # drop corrupted links

This is an example of links that I get from that code:

 ['https://www.crowdcube.com/investment/floodkit-16516', 
'https://www.crowdcube.com/investment/east-end-manufacturing-14667', 
'https://www.crowdcube.com/investment/wrap-it-up-18021']

Once having this list I thought to run a for loop with the same code of above. Thus:

for link in links:
    l=urllib.request.urlopen(link)
    tree= BeautifulSoup(l, 'lxml')


#title
    title=tree.find_all('div',{'class':'cc-pitch__title'})

    data['title'].append(title[0].find('h2').get_text())    

#description
    description=tree.find_all('div',{'class':'fullwidth'})

    data['description'].append(description[1].find('p').get_text())

#target

    target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})

    data['target'].append(target[0].find('dd').get_text())

#raised

    raised=tree.find_all('div',{'class':'cc-pitch__raised'})

    data['raised'].append(raised[0].find('b').get_text())


#category

    category=tree.find_all('li',{'class':'sectors'})

    data['category'].append(category[0].find('span').get_text() )

data

This does not work. I tried everything just to see the tree created at the first iteration and this is empty.

Maybe the problem is related to the fact that those links are strings?

print `link` (and/or `links`) to see what url(s) you have. – furas Feb 07 '16 at 08:45 — furas, Feb 07 '16 at 08:45

Padraic Cunningham · Accepted Answer · 2016-02-07T14:39:14.293

There are way more than three links on the page you linked to, I get 292, if you want to parse each of those do the following:

import requests
from bs4 import BeautifulSoup

url = "https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7"


def parse(so):
    return {'title': soup.title.text, 'description': so.find("div", {"class": "pitch-tabs"}).p.text,
            'target': so.find("div",{"class":"cc-pitch__stats clearfix"}).dd.text,
            'raised': so.find("div", {"class": "cc-pitch__raised"}).b.text,
            'category': " ".join(so.find("li",{"class":"sectors"}).span.text.split()),
            "title": so.title.text}


req = requests.get(url)

soup = BeautifulSoup(req.content)

links = {h.a["href"] for h in soup.find_all("h2", {"class": "pitch__title"})}

for link in links:
    print(link)
    soup = BeautifulSoup(requests.get(link).content)
    print(parse(soup))

A snippet of the output:

https://www.crowdcube.com/investment/property-moose-14045
{'category': u'Other, Internet Business, Technology', 'raised': u'\xa3169,010', 'target': u'\xa360,000', 'description': u'Property Moose is a new generation of property investment \u2013 taking the equity crowdfunding model and using it to allow users to invest in a wide range of properties from only \xa3500. Combining this with a fully integrated online platform, Property Moose aspires to take the Crowdfunding revolution by storm.', 'title': u'Property Moose raising \xa360,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/easyproperty-com-16655
{'category': u'Professional and Business Services, Internet Business', 'raised': u'\xa31,358,680', 'target': u'\xa31,000,000', 'description': u'easyProperty, the latest company from easyGroup, will offer individually priced property services. The venture, which has been founded by Sir Stelios (founder of easyJet) and Robert Ellice (a property entrepreneur with 20 years\u2019 experience), has been described by the FT as \u201ceasily the biggest brand name yet to enter the online estate agent business\u201d.', 'title': u'easyProperty.com raising \xa31,000,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/universal-fuels-phase-1-10466
{'category': u'Oil & Gas', 'raised': u'\xa3100,000', 'target': u'\xa3100,000', 'description': u'Universal Fuels Ltd is just over 2 years old, we supply diesel, petrol, lubricants and kerosene UK wide to homes, petrol stations, transport companies, construction firms and a range of other businesses. We have just been\u2026', 'title': u'Universal Fuels Phase 1 raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/stakis-daycare-nurseries-ltd-12468
{'category': u'Education, Other', 'raised': u'\xa3101,230', 'target': u'\xa3100,000', 'description': u'Stakis Daycare Nurseries is a new franchise provider of daycare nurseries in the UK.', 'title': u'Stakis Daycare Nurseries Ltd raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/bidstack-20749
{'category': u'Media and Creative Services, Internet Business, Technology', 'raised': u'\xa3138,970', 'target': u'\xa3100,000', 'description': u"Bidstack is a live bidding platform for last-minute digital advertising signage, aiming to make digital out of home advertising truly accessible for anyone. Bidstack launched their video at the O2 arena, raising brand awareness as the first steps to disrupt a growing \xa3multi-billion industry. The team's experience includes a \xa3multi-million business exit and a successfully overfunded Crowdcube campaign.", 'title': u'BidStack raising \xa3100,000 investment on Crowdcube. Capital At Risk.'}
https://www.crowdcube.com/investment/e-sign-14248
{'category': u'Internet Business', 'raised': u'\xa364,760', 'target': u'\xa350,000', 'description': u'E-Sign offers our clients a secure, advanced electronic signature solution to enable important documents to be signed, when required, by any person, anywhere, at any time. Traditional hand written signatures on documents can be expensive, time consuming and provide an opportunity for the signature to be forged. E-Sign allows companies to conclude business more rapidly, whilst reducing their running costs and combating fraud.', 'title': u'E-Sign raising \xa350,000 investment on Crowdcube. Capital At Risk.'}

No worries, you're welcome. You might want to play around with the category, not sure if all the text should be included or not — Padraic Cunningham, Feb 07 '16 at 17:37

Web scraping multiple pages with Beautiful Soup

1 Answers1