1

I've tried two completely different methods. But still I can't get the data that is only present after loggin in.

I've tried doing one using requests but the xpath returns a null import requests from lxml import html

USERNAME = "xxx"
PASSWORD = "xxx"

LOGIN_URL = "http://www.reginaandrew.com/customer/account/loginPost/referer/aHR0cDovL3d3dy5yZWdpbmFhbmRyZXcuY29tLz9fX19TSUQ9VQ,,/"
URL = "http://www.reginaandrew.com/gold-leaf-glass-top-table"


def main():
FormKeyTxt = ""
session_requests = requests.session()

# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
# Create payload
formKey = str((tree.xpath("//*[ @ id = 'login-form'] / input / @ value")))
FormKeyTxt = "".join(formKey)
#print(FormKeyTxt.replace("['","").replace("']",""))

payload = {
    "login[username]": USERNAME,
    "login[password]": PASSWORD,
    "form_key": FormKeyTxt,
    "persistent_remember_me": "checked"

}

# Perform login
result = session_requests.post(LOGIN_URL, data=payload)

# Scrape url
result = session_requests.get(URL, data=payload)
tree = html.fromstring(result.content)
bucket_names = tree.xpath("//span[contains(@class, 'in-stock')]/text()")
print(bucket_names)
print(result)
print(result.status_code)


if __name__ == '__main__':
main()

ive tried another one using Mechanical soup but still it returns a null

import argparse
import mechanicalsoup
import urllib.request
from bs4 import BeautifulSoup

parser = argparse.ArgumentParser(description='Login to GitHub.')
parser.add_argument("username")
parser.add_argument("password")
args = parser.parse_args()

browser = mechanicalsoup.Browser()

login_page = browser.get("http://www.reginaandrew.com/gold-leaf-glass-top-table")
login_form = login_page.soup.select("#login-form")[0]


login_form.input({"login[username]": args.username, "login[password]": args.password})


page2 = browser.submit(login_form,login_page.url )
messages = page2.soup.find(class_='in-stock1')
if messages:
    print(messages.text)

print(page2.soup.title.text)

I understand the top solution better so id like to do it using that but is there anything I'm missing? (I'm sure I'm missing a lot)

nalzok
  • 14,965
  • 21
  • 72
  • 139
Shashwat Aryal
  • 126
  • 1
  • 11
  • You should note that unauthorized scarping of websites can potential be illegal. – Christian Dean Dec 14 '16 at 16:43
  • @leaf yes, i know. But this is work related and they are paying us to do so. Thanks for the concern! – Shashwat Aryal Dec 14 '16 at 16:53
  • 1
    Have you tried using Selenium for this? – pragman Dec 14 '16 at 17:12
  • There a a lot of reasons this could be failing and it might ultimately prove impossible. I would just move on to using an automated browser using a library like Selenium, if you are doing this from a VPS you can even do a headless browser using pyvirtualdisplay no GUI required. – Tim McDonald Dec 14 '16 at 17:22
  • @Bitonator No, i have not but I have heard its name here and there. Will try it and post updates. Thanks! – Shashwat Aryal Dec 14 '16 at 17:22
  • Take a look at the code for the form. It submits a POST request to a (different) URL. I'd try using that URL directly with the username/password in as parameters. (Sorry I'm not more specific, I haven't actually done it in a while.) – A. L. Flanagan Dec 14 '16 at 17:24
  • @TimMcDonald I'll give Selenium a shot as Bitonatory suggested it too. I have no idea what headless browsers or pyvirtualdisplays are. Ill try them and post updates. Thanks! – Shashwat Aryal Dec 14 '16 at 17:27
  • @A.L.Flanagan I have edited the code but still no results. Do you have any other suggestion? – Shashwat Aryal Dec 14 '16 at 17:38
  • @ShashwatAryal Well in that case, alright then. – Christian Dean Dec 14 '16 at 18:19
  • @A.L.Flanagan is right, looks like it generates a different POST url each time, so you'll have to get the new POST url dynamically (scrape it from your GET results). Also, it posts more than just the login[username] and login[password]. You'll need to POST **all** inputs. I see `form_key`, `persistent_remember_me` as valid inputs which you're not posting. `form_key` is clearly required & you'll have to scrap it's value from the GET results, too. – pbuck Dec 14 '16 at 18:26
  • @Peter I did add form_key and persistent_remember_me but still no resutls. Did i do it correctly? I don't get what you and A.L.Flanagan meant by POST URL. Is it the value of action in the form tag? if so it doesn't seem to change and I did add that to my Login_URL. – Shashwat Aryal Dec 15 '16 at 02:32
  • @A.L.Flanagan what did you mean by POST URL? Is it the value of action in the form tag? – Shashwat Aryal Dec 15 '16 at 02:36
  • Yes, the POST URL would be the action in the form tag. Your site seems to be down, so I can't further check it, don't have access to the javascript & that can change everything: (That's one way websites make it harder to casually scrape their contents.) – pbuck Dec 15 '16 at 04:54
  • @Peter your input helped me a lot. I have some more questions regarding this same topic. Could you lend a hand? Thanks! [New Question](http://stackoverflow.com/questions/41986221/logging-into-websites-using-request) – Shashwat Aryal Feb 01 '17 at 17:43
  • @A.L.Flanagan you did point me in the right direction last time. I have some more questions regarding this same topic. Could you lend a hand? Thanks! [New Question](http://stackoverflow.com/questions/41986221/logging-into-websites-using-request) – Shashwat Aryal Feb 01 '17 at 17:45

1 Answers1

1

This should do it

import requests
import re

url = "http://www.reginaandrew.com/"
r = requests.session()
rs = r.get(url)
cut = re.search(r'<form.+?id="login-form".+?<\/form>', rs.text, re.S|re.I).group()
action = re.search(r'action="(.+?)"', cut).group(1)
form_key = re.search(r'name="form_key".+?value="(.+?)"', cut).group(1)
payload = {
    "login[username]": "fugees",
    "login[password]": "nugees",
    "form_key": form_key,
    "persistent_remember_me": "on"
}
rs = r.post(action, data=payload, headers={'Referer':url})
0x3h
  • 452
  • 9
  • 22
  • ill try and let you know later in the evening. Thanks! – Shashwat Aryal Dec 15 '16 at 07:53
  • it works! yay. thankyou so much. only thing i can see different here from my code is that you have "on" for "persistent_remember_me". Is this the only place where i was wrong? I'd be grateful if you could point that out as well. – Shashwat Aryal Dec 15 '16 at 08:48
  • I believe form_key was the problem, You can easily play with the posted data and determine where the problem was. – 0x3h Dec 15 '16 at 12:17
  • I have some more questions regarding this same topic. Could you lend a hand? Thanks! [New Question](http://stackoverflow.com/questions/41986221/logging-into-websites-using-request) – Shashwat Aryal Feb 01 '17 at 17:46