Beautiful Soup
parameter

Question

I'm trying to print out the titles of each item on donedeal and copying the code from my own spider that works flawless at Over clockers and changing code accordingly:

import requests from bs4 import BeautifulSoup

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

Page numbers go like: 0, 28, 56.. so I had to make page number change accordingly at top of function.

The issue is that nothing is ever printed and I get exit code 0. Thanks in advance. Edit2: Im trying to scrape from "< p class="card__body-title">Angus calves< /p >".

BeigeBruceWayne · Answer 1 · 2017-04-11T22:37:32.117

You need to specify a different User-Agent in your request to make it seem like you're a real person, (i.e. headers={'User-Agent': 'Mozilla/5.0'} ). Once you do that your code works as intended.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        req = Request('https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page), headers={'User-Agent': 'Mozilla/5.0'})
        plain_text = urlopen(req).read()
        plain_text.decode('utf-8')
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

score 2 · Accepted Answer · edited May 23 '17 at 12:25

When inspecting the soup in pdb (break point before your for loop) I found:

(Pdb++) p soup
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n\n<html><head>\n<title>410 
Gone</title>\n</head><body>\n<h1>Gone</h1>\n<p>The requested 
resource<br/>/farming<br/>\nis no longer available on this server and there is 
no forwarding address.\nPlease remove all references to this resource.
</p>\n</body></html>\n

This probably means there is some anti-scraping measure in place! The site has detected that you're trying to scrape using python, and sent you to a page where you couldn't get any data.

In the future, I recommend using pdb to inspect the code, or perhaps printing out the Soup when you run into an issue! This can help clear up what happened, and show you what tags are available

EDIT:

Although I wouldn't necessarily recommend it (scraping is against donedeal's terms of service) there is a way to get around this.

If you feel like living on the wild side, you can make the requests module HTTP request look like it's coming from a real user, not a script. You can do this using the following:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

def donedeal(max_pages):
    for i in range(1, max_pages+1):
        page = (i - 1) * 28
        url = 'https://www.donedeal.ie/farming?sort=publishdate%20desc&start={}'.format(page) # http:/?...
        source_code = requests.get(url, headers=headers)
        plain_text = source_code.content
        soup = BeautifulSoup(plain_text, "html.parser")
        for title in soup("p", {"class": "card__body-title"}):
            x = title.text
            print(x)

donedeal(1)

All I did was tell the requests module to use the headers provided in headers. This makes the request look like it was coming from a Mac using Firefox.

I tested this and it seemed like it printed out the titles you want, no 410 error! :)

See this answer for more

See edit! I added some working code that gets you the results you were looking for. Please upvote and mark accepted if it helps! — Christopher Apple, Apr 11 '17 at 22:11
See edit? I've been watching this question and you literally just copied Kade Killary's code from his answer. That's smelly bro. — Nice-Guy, Apr 11 '17 at 22:34
I'll file a complaint because we don't need that in our community. — Nice-Guy, Apr 11 '17 at 22:34
I'm sorry if I did something wrong! I cited the other user's answer, I just put it directly in line with the rest of @alienware13user 's original code. — Christopher Apple, Apr 11 '17 at 23:08

Beautiful Soup parameter

2 Answers2

Beautiful Soup
parameter