0

I am trying to scrape the data from the website www.vestiairecollective.com While scraping I have access to only few of its main pages. For example my script cannot scrape the data for the url http://www.vestiairecollective.com/women-bags/handbags/#_=catalog&id_brand%5B%5D=50&material%5B%5D=3&step=180 .

I have referred many questions of stack overflow which shows how to do it. As I am using python 3.5 on Windows, "mechanize" and "cookielib" doesnt work. I also saw few questions pointing out libraries like "robobrowser" can do the work. I tried with that too and got stuck in the middle.

Then I tried with sessions and when I type with request.Sessions(), it says request doesnt have an attribute called sessions.

Please help me either with robobrowser or any other way with code for this particular website when I use the above mentioned URL.

This is what I have tried after referring the answer:-

import urllib.request
from bs4 import BeautifulSoup
import requests
session=requests.Session()
loginUrl='http://www.vestiairecollective.com/'
resLogin=session.post(loginUrl,data=  {'h':'5fcdc0ac04537595a747e2830037cca0','email':'something@gmail.com','password':'somepasswrd','ga_client_id':'750706459.1463098234'})
url='http://www.vestiairecollective.com/women-bags/handbags/#_=catalog&id_brand%5B%5D=50&material%5B%5D=3'
res=session.get(url)
//The below url i have given because I want to scrape from this url
crl=urllib.request.urlopen("http://www.vestiairecollective.com/women-bags/handbags/#_=catalog&id_brand%5B%5D=50&material%5B%5D=3")

soup=BeautifulSoup(crl.read(),"html.parser")

geturl=soup.find_all("div",{"class":"expand-snippet-container"})    

for i in geturl:           //The Scraping Part
    data1=i.find_all("p",{"class":"brand"})
    datac1=[da.contents[0] for da in data1]
    brdata=("\n".join(datac1))
    print(brdata)

Here the scraping should be done from the "crl" page but its doing from the main page itself.

Ro_nair
  • 77
  • 1
  • 10
  • 1
    Is that a typo? Have you tried `request.Sessions` or `request.Session`? The former doesn't exist (the library is called `requests`, and the object is a `Session()`. – Oliver W. Jun 03 '16 at 06:53

1 Answers1

0

You've got an error in request.Sessions() which should be request.Session().

My answer to a similar question will provide some sample code for persistant login with python requests (Python 3).

Shortly summarized:

  • use the requests module to create a session
  • You can login yourself with post or get parameters
  • further requests with the session object will handle the cookies appropriately
  • make sure you use some realistic user agent (otherwise some sites won't log you in, since they consider your script as a bot)

Some relevant lines of code to inspire you (not working as is, you need to modify these to your need):

import requests
session = requests.Session()
session.headers.update({'user-agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1')
# use the site's form field names for the following line
# (and use the resLogin for checking successful login):
resLogin = session.post(loginUrl, data = {'user' : 'username', 'password' : 'pwd'}) 
# follow-up calls to a session which was used to login
res = session.get(url)
Community
  • 1
  • 1
DomTomCat
  • 8,189
  • 1
  • 49
  • 64
  • Thank you for your response! But the first line is giving an error `NameError: name 'requests' is not defined` . Also I had one more info to this. This website doesnt have a separate page for login. If you try to visit any page, the url remains the same but a Small PANEL appears in the center with a form. If you enter it correctly, it allows inside – Ro_nair Jun 03 '16 at 06:56
  • you have to import requests with `import requests`. For the login data you'll have to find out what the login process will send, e.g. by reading the html-source or, alternatively with firefox web developer tools/plugins. Also find out if it's being sent with `post` or `get` data (-> `session.post()` vs. `session.get()`) – DomTomCat Jun 03 '16 at 06:59
  • got that error corrected! I also want to ask that as my script would go to different URLs of the website and collect data from them in a single run, do I have to write this login process each time before requesting a URL? Beacause as I mentioned, the website is and I want to collect data from the big url i mentioned in my question. – Ro_nair Jun 03 '16 at 08:55
  • nope, just once. Then check the result content if it somehow says you're logged in. From then on (and using `session.get()`/`.post()`) you should be able to continue without login. However, whenever you restart the script in the process, you'll have to re-login – DomTomCat Jun 03 '16 at 08:57
  • Thank you so much, I will try and mark the answer! Please, I request you to help me if i come across an issue while doing this. – Ro_nair Jun 03 '16 at 09:00
  • @DonTomCat I dont have any login URL for this site. What should I give the url as? As I said just a panel opens up in the center of the page for the login credentials. Shall I just give the website name as it is? – Ro_nair Jun 03 '16 at 09:03
  • The login url can be any. You have to find out which url the login-panel sends the data to and how the data look like. It may well be just the website name as it is but it may also be some asynchronous call with Javascript to a different URL. – DomTomCat Jun 03 '16 at 09:06
  • Will I get it by checking the form action?? because in this website the `form action='\' method='post'` And What if its an asynchronous call? Its driving me nuts! :( – Ro_nair Jun 03 '16 at 09:30
  • Yes, that would mean that the form is transmitted to the normal site url. Unless there's an `onSubmit` attribute of the form tag (which could - but doesn't have to - mean that Javascript transmits data to a different url). In the latter, you'd have to debug the JS call (or look out for browser&network-debugging-plugins) – DomTomCat Jun 03 '16 at 10:55
  • When I print res variable, it shows, `response=200`. My last and final question will be how to check whether I have entered it with MY profile. I meant printing the page source – Ro_nair Jun 03 '16 at 11:36
  • One bad news! I thought everything was going right and now when I scraped the data, it is not scraping from the desired page, its only scraping from the main page which does not require any login. The website has a feature like, if it finds there is no user logged in, It automatically redirects to the main page itself – Ro_nair Jun 03 '16 at 11:47
  • May I post my code which I tried in the Chat discussions so that you can have a look?? – Ro_nair Jun 03 '16 at 11:48
  • well there's always a `get` or `post` parameter to navigate for the webserver (even if it looks like the start page). You may have to include these into the `get()` or `post()` call. Note, same as for the login parameters: you have to find out which they are. You can access the result source with `res.text` or `res.content.decode()`. Code: sure go ahead – DomTomCat Jun 03 '16 at 11:51
  • Shit! I dont have chat options too. less reputation – Ro_nair Jun 03 '16 at 11:57
  • hmm, add an "EDIT" to your question and add the current minimal functional code extract (remove pwds etc) – DomTomCat Jun 03 '16 at 12:21
  • get rid of the `urllib.request.urlopen()` call, this hasn't anything to do with your created session object (you wouldn't be logged in). Replace it with `res=session.get()`. You'd then use `soup=BeautifulSoup(res.text,"html.parser")` or `soup=BeautifulSoup(res.content.decode(),"html.parser")` – DomTomCat Jun 03 '16 at 12:57
  • I used the above too. Still it does not scrape from the page I want to scrape from. Its automatically redirecting to the accesible pages only. I know its too much to ask. You have helped a lot! Thank you so much man! If you wish to, you can reply with your opinions further about what can I be going wrong. I deleted the "crl" part from my code and directly done the same as you commented. – Ro_nair Jun 03 '16 at 14:40
  • Sorry, I'm offline for a while. You should have the necessary tools though. good luck – DomTomCat Jun 03 '16 at 14:44