I am writing some code to scrape some data from crowdcube.
The idea is to get the information Title, Description, target capital, raised capital and category
First I made an attempt on a single page. The code worked. Here it is:
from bs4 import BeautifulSoup
import urllib, re
data = {
'title' : [],
'description' : [],
'target' : [],
'raised':[],
'category' : []
}
l=urllib.request.urlopen('https://www.crowdcube.com/investment/primo-18884')
tree= BeautifulSoup(l, 'lxml')
#title
title=tree.find_all('div',{'class':'cc-pitch__title'})
data['title'].append(title[0].find('h2').get_text())
#description
description=tree.find_all('div',{'class':'fullwidth'})
data['description'].append(description[1].find('p').get_text())
#target
target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})
data['target'].append(target[0].find('dd').get_text())
#raised
raised=tree.find_all('div',{'class':'cc-pitch__raised'})
data['raised'].append(raised[0].find('b').get_text())
#category
category=tree.find_all('li',{'class':'sectors'})
data['category'].append(category[0].find('span').get_text() )
data
I need to download the same information from all the projects on the website.
All links are included in this page: (https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7)
To do so, I started creating a list of URLs with this code:
source= urllib.request.urlopen('https://www.crowdcube.com/investments?sort_by=0&q=&hof=1&i1=0&i2=0&i3=0&i4=0&sort_by=7')
get_link= BeautifulSoup(source, 'lxml')
links_page = [a.attrs.get('href') for a in get_link.select('a[href]')]
links_page = list(set(links_page)) #drops duplicates
links = [l for l in links_page if 'https://www.crowdcube.com/investment/' in l] # drop corrupted links
This is an example of links that I get from that code:
['https://www.crowdcube.com/investment/floodkit-16516',
'https://www.crowdcube.com/investment/east-end-manufacturing-14667',
'https://www.crowdcube.com/investment/wrap-it-up-18021']
Once having this list I thought to run a for loop with the same code of above. Thus:
for link in links:
l=urllib.request.urlopen(link)
tree= BeautifulSoup(l, 'lxml')
#title
title=tree.find_all('div',{'class':'cc-pitch__title'})
data['title'].append(title[0].find('h2').get_text())
#description
description=tree.find_all('div',{'class':'fullwidth'})
data['description'].append(description[1].find('p').get_text())
#target
target=tree.find_all('div',{'class':'cc-pitch__stats clearfix'})
data['target'].append(target[0].find('dd').get_text())
#raised
raised=tree.find_all('div',{'class':'cc-pitch__raised'})
data['raised'].append(raised[0].find('b').get_text())
#category
category=tree.find_all('li',{'class':'sectors'})
data['category'].append(category[0].find('span').get_text() )
data
This does not work. I tried everything just to see the tree created at the first iteration and this is empty.
Maybe the problem is related to the fact that those links are strings?