1

It has been few days that I tried to scrape this page: http://londoncoffeeguide.com/

I tried to use requests or scrapy, but I'm newby to the scrapin world and I cannot find a way to login. Is it possible to login to this website with requests and use BeautifulSoup to scrape it? Or is it possible to do it with scrapy?

Furthermore, I tried to test requests following this example, and to test it on wikipedia, using the same pages linked there I tried this:

import requests
from bs4 import BeautifulSoup as bs


def get_login_token(raw_resp):
    soup = bs(raw_resp.text, 'lxml')
    token = [n['value'] for n in soup.find_all('input')
    if n['name'] == 'wpLoginToken']
    return token[0]


payload = {
    'wpName': 'my_login',
    'wpPassword': 'my_pass!',
    'wpLoginAttempt': 'Log in',
    #'wpLoginToken': '',
    }

with requests.session() as s:
    resp = s.get('http://en.wikipedia.org/w/index.php?title=Special:UserLogin')
    payload['wpLoginToken'] = get_login_token(resp)
    print payload
    response_post = s.post('http://en.wikipedia.org/w/index.php?title=Special:UserLogin&action=submitlogin&type=login', data=payload)
    response = s.get('http://en.wikipedia.org/wiki/Special:Watchlist')

    r = bs(response.content)
    print r.get_text()

What I see is that I still get the suggestion to login in order to see the wishlist page.

Where is the mistake?

Community
  • 1
  • 1
foebu
  • 1,365
  • 2
  • 18
  • 35
  • Don't worry about scraping and `BeautifulSoup` until you can get the page you want in the first place; you're just adding complexity that will make things harder to debug. – abarnert Nov 04 '13 at 22:39
  • Anyway, I notice that you aren't looking at `response_post` at all. So… how do you know whether you logged in successfully? If you didn't, you obviously won't be logged in on subsequent pages… – abarnert Nov 04 '13 at 22:46
  • Also, any particular reason you're trying to scrape the web interface instead of using the [MediaWiki API](http://www.mediawiki.org/wiki/API:Main_page)? – abarnert Nov 04 '13 at 22:51
  • Hello abarnet: my goal is not to scrape wikipedia, but to scrape londoncoffee. I'm trying to scrape wiki using the web interface in order to make some practice. Here I'm using beautifulsoup in order to understand if I'm logged in or not. Any other way to understand if I'm in or not? – foebu Nov 04 '13 at 22:53
  • Yeah, look at the `response_post`. Is it the same thing you get in the browser? If so, is there a redirect you have to follow? Or some JS code that the site is expecting you to run? – abarnert Nov 04 '13 at 22:58
  • Well, I get the same thing I get in the browser before I login. Now I would like to login and in this way it doesn't login. (Thank you very much for your help, abarnert!) This for wiki. Regarding the other website, I see again what I see before the login. I see the same thing I would see looking at the source via browser (before the login). If I try to login... no login. Like if nothing happened. If I choose a page I want to scrape, since I need to login to access the page, it sends me to the login page. – foebu Nov 04 '13 at 23:34
  • OK, if the response to your POST doesn't show that you logged in, then it's unlikely that you've logged in, and that's what you have to fix; all the subsequent stuff is irrelevant. Do you want to try to debug this with Wikipedia, or with the site you actually care about? – abarnert Nov 04 '13 at 23:40
  • Also, note that many sites' terms & conditions specifically do not allow you to scrape the site with automated software, and that's _especially_ true for sites that have APIs they want you to use, and that means they may deliberately try to break login via scrapers, or just not care about it, never test it, and eventually break it… – abarnert Nov 04 '13 at 23:41
  • Although I notice that London Coffee Guide's T&C link is [a 404 page](http://londoncoffeeguide.com/Terms---Conditions.aspx). I don't know British law, but I think that means you're allowed to copy all their data and then sue them for copyright infringement for having the original, right? – abarnert Nov 04 '13 at 23:42
  • What I would like to fix is the login on the website I care about: londoncoffeeguide.com. Their page terms&condition doesn't exists, or better, or at least it's offline now. Their login form, though, seems to have a javascript function as action associated to the login. Maybe my question was unclear, but what actually I am asking is some help with the login. – foebu Nov 04 '13 at 23:45
  • Well, they published a book with these data and they do it every year. I just would like to have those data in a more tidy and useful format. They did the same in New York: newyorkcoffeeguide.com or something like that. – foebu Nov 04 '13 at 23:49
  • For the NY website the [disclaimer](http://www.newyorkcoffeeguide.com/Terms-Conditions.aspx) exists but it seems it doesn't say anything about "scraping".. they say that the data couldn't be correct, though. – foebu Nov 04 '13 at 23:56
  • Are you sure that 4.3 in those T&C isn't a problem? Anyway, if their login works by JavaScript, you have three choices: (1) figure out what that JS does and do the same thing (possibly by just capturing what your browser sends as a result of that JS), (2) use a JS engine to actually run the JS within your scraper, or (3) use Selenium to drive a complete browser. If (1) is feasible, it's usually the best option. Do you know how to log the requests your browser makes? – abarnert Nov 05 '13 at 00:03
  • Well, actually you are right, that's a problem. And it's a pity, given the format of the data they have. I could ask them a permission, and let see. Anyway, I would like to try to login via python. It could be useful for other cases. Referring to your points: 1. how can I catch what my browser sends? I tried to use software like charles or fiddler, but I have never used such a software then I got lost. 2. which kind of engine? [Zombie](http://zombie.labnotes.org/), for example? Or am I out of track? 3. I've only heard about it. – foebu Nov 05 '13 at 00:21

1 Answers1

1

I got this to login (yes i created an account and tested it)

from mechanize import Browser
    br = Browser()
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Firefox')]
    br.open("http://www.londoncoffeeguide.com")
    for form in br.forms():
        if form.attrs['id'] == 'form':
            br.form = form
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$UserName'] = 'username goes here'
    br.form['p$lt$zoneContent$PagePlaceholder$p$lt$zoneRight$logonform$Login1$Password'] = 'password goes here'
    response = br.submit()

then you can pass response.read() to beautiful soup and do all kinds of stuff

j011y
  • 111
  • 3
  • Good answer, j0lly! Can you tell me just few things more about mechanize? Are there modules which allow to do the same? Is it similar to Selenium? - for the sake of curiosity and completeness. Thank you! :) – foebu Nov 05 '13 at 09:19
  • Actually I have another question: how to proceed visiting another page of the same website without logging in again? I tried to "br.open" another url but it requires another login. – foebu Nov 05 '13 at 10:33
  • Thanks! I don't know a terrible lot about mechanize to be honest as I only used it for the first time the other day at work. after you submit the form (which will log you in) you should be able to follow additional links by using the br.follow_link(text="the actual link text"). – j011y Nov 05 '13 at 22:56
  • Oh, great, thank you! I got confused by the fact that it works with the object link and not with the url. Thank you very much. (Even though everything is useless if I am not allowed to use those data) – foebu Nov 06 '13 at 13:20
  • No probs. You should also be able to loop over all the links until you find the url you want also. check my answer to this question http://stackoverflow.com/questions/19803075/mechanize-mechanize-linknotfounderror/19804123#19804123 – j011y Nov 06 '13 at 22:21