Python 3.5 | Scraping data from website

Question

I want to scrape a specific part of the website Kickstarter.com

I need the strings of the Project-title. The website is structured and every project has this line.

<div class="Project-title">

My code looks like:

#Loading Libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup

#define URL for scraping
theurl = "https://www.kickstarter.com/discover/advanced?category_id=16&woe_id=23424829&sort=popularity&seed=2448324&page=1"
thepage = urllib.request.urlopen(theurl)

#Cooking the Soup
soup = BeautifulSoup(thepage,"html.parser")

#Scraping "Project Title" (project-title)
project_title = soup.find('h6', {'class': 'project-title'}).findChildren('a')
title = project_title[0].text
print (title)

If I use the soup.find_all or set another value at the line Project_title[0] instead of zero, Python shows an error.

I need a list with all the project titles of this Website. Eg.:

The Superbook: Turn your smartphone into a laptop for $99
Weights: Weigh Smarter
Mine Kafon Drone World's First And Only Complete
Weather Camera System Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux

Looking at BeautifulSoup's find function, you'll see that it only returns the first element =/ — HolyDanna, Jul 25 '16 at 10:48
@Sebastian Fischer, if you have a new question then ask a new question, don't edit code from an answer into your original question — Padraic Cunningham, Jul 25 '16 at 13:38

HolyDanna · Accepted Answer · 2016-07-25T12:43:35.330

2

find()only returns one element. To get all, you must use findAll

Here's the code you need

project_elements = soup.findAll('h6', {'class': 'project-title'})
project_titles = [project.findChildren('a')[0].text for project in project_elements]
print(project_titles)

We look at all the elements of tag h6 and class project-title. We then take the title from each of these elements, and create a list with it.

Hope it helped, and don't hesitate to ask if you have any question.

edit : the problem of the above code is that it will fail if we do not get at least a child of tag a for each element in the list returned by findAll

How to prevent this :

project_titles = [project.findChildren('a')[0].text for project in project_elements if project.findChildren('a')]

this will create the list only if the project.findChildren('a') as at least one element. (if [] returns False)

edit : to get the description of the elements (class project-blurb), let's look a bit at the HTML code.

<p class="project-blurb">
Bagel is a digital tape measure that helps you measure, organize, and analyze any size measurements in a smart way.
</p>

This is only a paragraph of class project-blurb. To get them, we could use the same as we did to get the project_elements, or more condensed :

project_desc = [description.text for description in soup.findAll('p', {'class': 'project-blurb'})]

edited Jul 25 '16 at 12:43

answered Jul 25 '16 at 10:54

HolyDanna

609
4
13

Hey HolyDana. Thank you so much!!!!! But I get an error: "IndexError: list index out of range". Do you know why? – Sebastian Fischer Jul 25 '16 at 11:17
@SebastianFischer this error comes from `project.findChildren('a')[0]`: it fails to find at least a child for one of the elements. I'll edit to add an alternative way to do it, while preventing this error. – HolyDanna Jul 25 '16 at 11:26
Oh HolyDonna.. Thank you. But it won't work. I only get the result "[]" when I print project_titles – Sebastian Fischer Jul 25 '16 at 11:39
@SebastianFischer I only realised I forgot to use `findAll` instead of `find` .... The code should be correct now. – HolyDanna Jul 25 '16 at 12:05
Hey @HolyDanna.... Thank you. The code works. Now I get a list, seperated with comma and the correct strings. I want to adapt your code to the class "Project-blurb" to get the description of the Proroject. I paste the code in my question on top.... Thank you – Sebastian Fischer Jul 25 '16 at 12:34
@SebastianFischer look at my edit. You should take a deeper look at the HTML code, as the description does not have the same format as the title. – HolyDanna Jul 25 '16 at 12:44
Thank you so much. It works and I'm happy right now. Please help me with the next step. I want to scrape the Link (href) from the
. I will paste my (non working) code on top to my quesion :)
– Sebastian Fischer Jul 25 '16 at 13:30
There is the new question. Please help me... http://stackoverflow.com/questions/38570411/how-to-scrape-href-with-python-3-5-and-beautifulsoup – Sebastian Fischer Jul 25 '16 at 14:29

score 1 · Answer 2 · answered Jul 26 '16 at 08:37

With respect to the title of this post i would recommend you two different tutorial based on scraping particular data from a website . They do have a detailed explanation regarding how the task is achieved.

Firstly i would recommend to checkout pyimagesearch Scraping images using scrapy.

then you should try if you are more specific web scraping will help you.

Padraic Cunningham · Answer 3 · 2016-07-25T11:06:07.893

All the data you want is in the section with the css class staff-picks, just find the h6's with the project-title class and extract the text from the anchor tag inside:

soup = BeautifulSoup(thepage,"html.parser")


print [a.text for a in soup.select("section.staff-picks h6.project-title a")]

Output:

[u'The Superbook: Turn your smartphone into a laptop for $99', u'Weighitz: Weigh Smarter', u'Omega2: $5 IoT Computer with Wi-Fi, Powered by Linux', u"Bagel: The World's Smartest Tape Measure", u'FireFlies - Truly Wire-Free Earbuds - Music Without Limits!', u'ISOLATE\xae - Switch off your ears!']

Or using find with find_all:

project_titles = soup.find("section",class_="staff-picks").find_all("h6", "project-title")
print([proj.a.text for proj in project_titles])

There is also only one anchor tag inside each h6 tag so you cannot end up with more than one whatever approach you take.

Python 3.5 | Scraping data from website

3 Answers3

. I will paste my (non working) code on top to my quesion :)