My previous Question (logging in to website using requests) generated some awesome answers and with that I was able to scrape a lot of sites. But the site I'm working on now is tricky. I don't know if it's a website bug or something done intentionally, but i cannot scrape it.
heres a part of my code.
import requests
import re
from lxml import html
from multiprocessing.dummy import Pool as ThreadPool
from fake_useragent import UserAgent
import time
import ctypes
global FileName
now = time.strftime('%d.%m.%Y_%H%M%S_')
FileName=str(now + "Scraped data.txt")
fileW = open(FileName, "w")
url = open('URL.txt', 'r').read().splitlines()
fileW.write("URL Name SKU Dimensions Availability MSRP NetPrice")
fileW.write(chr(10))
count=0
no_of_pools=14
r = requests.session()
payload = {
"email":"I cant give them out in public",
"password":"maybe I can share it privately if anyone can help me with it :)",
"redirect":"true"
}
rs = r.get("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register")
rs = r.post("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register",data=payload,headers={'Referer':"https://checkout.reginaandrew.com/store/my_account.ssp"})
rs = r.get("https://checkout.reginaandrew.com/store/my_account.ssp")
tree = html.fromstring(rs.content)
print(str(tree.xpath("//*[@id='site-header']/div[3]/nav/div[2]/div/div/a/@href")))
The problem is that even when i manually log in and open a product URL, by entering it in the address bar, the browser doesn't recognize that it's logged in.
The only way around that is clicking a link in the page you are redirected after logging in. Only then does the browser recognize it has logged in and i can open specific URLs and see all the information.
What obstacle I ran into is that the link changes. The print statement in the code
print(str(tree.xpath("//*[@id='site-header']/div[3]/nav/div[2]/div/div/a/@href")))
This should've extracted the link but it returns nothing.
any ideas?
EDIT (stripping out white space) rs.content is:
<!DOCTYPE html><html lang="en-US"><head><meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<link rel="shortcut icon" href="https://checkout.reginaandrew.com/c.1283670/store/img/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<title></title>
<!--[if !IE]><!-->
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css/checkout.css?t=1484321730904">
<!--<![endif]-->
<!--[if lte IE 9]>
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_2.css?t=1484321730904">
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_1.css?t=1484321730904">
<link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout.css?t=1484321730904">
<![endif]-->
<!--[if lt IE 9]>
<script src="/c.1283670/store/javascript/html5shiv.min.js"></script>
<script src="/c.1283670/store/javascript/respond.min.js"></script>
<![endif]-->
<script>var SC=window.SC={ENVIRONMENT:{jsEnvironment:typeof nsglobal==='undefined'?'browser':'server'},isCrossOrigin:function(){return 'checkout.reginaandrew.com'!==document.location.hostname},isPageGenerator:function(){return typeof nsglobal!=='undefined'},getSessionInfo:function(key){var session=SC.SESSION||SC.DEFAULT_SESSION||{};return key?session[key]:session},getPublishedObject:function(key){return SC.ENVIRONMENT&&SC.ENVIRONMENT.published&&SC.ENVIRONMENT.published[key]?SC.ENVIRONMENT.published[key]:null}};function loadScript(data){'use strict';var element;if(data.url){element='<script src="'+data.url+'"></'+'script>'}else{element='<script>'+data.code+'</'+'script>'}if(data.seo_remove){document.write(element)}else{document.write('</div>'+element+'<div class="seo-remove">')}}
</script>
</head>
<body>
<noscript>
<div class="checkout-layout-no-javascript-msg">
<strong>Javascript is disabled on your browser.</strong><br>
To view this site, you must enable JavaScript or upgrade to a JavaScript-capable browser.
</div>
</noscript>
<div id="main" class="main"></div>
<script>loadScript({url: '/c.1283670/store/checkout.environment.ssp?lang=en_US&cur=USD&t=' + (new Date().getTime())});
</script>
<script>if (!~window.location.hash.indexOf('login-register') && !~window.location.hash.indexOf('forgot-password') && 'login-register'){window.location.hash = 'login-register';}
</script>
<script src="/c.1283670/store/javascript/checkout.js?t=1484321730904"> </script>
<script src="/cms/2/assets/js/postframe.js"></script>
<script src="/cms/2/cms.js"></script>
<script>SCM['SC.Checkout'].Configuration.currentTouchpoint = 'login';</script>
</body>
</html>