0

My previous Question (logging in to website using requests) generated some awesome answers and with that I was able to scrape a lot of sites. But the site I'm working on now is tricky. I don't know if it's a website bug or something done intentionally, but i cannot scrape it.

heres a part of my code.

import requests
import re
from lxml import html
from multiprocessing.dummy import Pool as ThreadPool
from fake_useragent import UserAgent
import time
import ctypes

global FileName

now = time.strftime('%d.%m.%Y_%H%M%S_')
FileName=str(now + "Scraped data.txt")
fileW = open(FileName, "w")
url = open('URL.txt', 'r').read().splitlines()
fileW.write("URL    Name    SKU Dimensions  Availability    MSRP    NetPrice")
fileW.write(chr(10))
count=0
no_of_pools=14
r = requests.session()

payload = {
    "email":"I cant give them out in public",
    "password":"maybe I can share it privately if anyone can help me with it :)",
    "redirect":"true"
    }
rs = r.get("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register")
rs = r.post("https://checkout.reginaandrew.com/store/checkout.ssp?fragment=login&is=login&lang=en_US&login=T#login-register",data=payload,headers={'Referer':"https://checkout.reginaandrew.com/store/my_account.ssp"})
rs = r.get("https://checkout.reginaandrew.com/store/my_account.ssp")
tree = html.fromstring(rs.content)
print(str(tree.xpath("//*[@id='site-header']/div[3]/nav/div[2]/div/div/a/@href")))

The problem is that even when i manually log in and open a product URL, by entering it in the address bar, the browser doesn't recognize that it's logged in.

The only way around that is clicking a link in the page you are redirected after logging in. Only then does the browser recognize it has logged in and i can open specific URLs and see all the information.

What obstacle I ran into is that the link changes. The print statement in the code

print(str(tree.xpath("//*[@id='site-header']/div[3]/nav/div[2]/div/div/a/@href")))

This should've extracted the link but it returns nothing.

any ideas?

EDIT (stripping out white space) rs.content is:

<!DOCTYPE html><html lang="en-US"><head><meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <link rel="shortcut icon" href="https://checkout.reginaandrew.com/c.1283670/store/img/favicon.ico" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
    <title></title>
    <!--[if !IE]><!-->
    <link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css/checkout.css?t=1484321730904">
    <!--<![endif]-->
    <!--[if lte IE 9]>
    <link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_2.css?t=1484321730904">
    <link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout_1.css?t=1484321730904">
    <link rel="stylesheet" href="https://checkout.reginaandrew.com/c.1283670/store/css_ie/checkout.css?t=1484321730904">
    <![endif]-->
    <!--[if lt IE 9]>
    <script src="/c.1283670/store/javascript/html5shiv.min.js"></script>
    <script src="/c.1283670/store/javascript/respond.min.js"></script>
    <![endif]-->
    <script>var SC=window.SC={ENVIRONMENT:{jsEnvironment:typeof nsglobal==='undefined'?'browser':'server'},isCrossOrigin:function(){return 'checkout.reginaandrew.com'!==document.location.hostname},isPageGenerator:function(){return typeof nsglobal!=='undefined'},getSessionInfo:function(key){var session=SC.SESSION||SC.DEFAULT_SESSION||{};return key?session[key]:session},getPublishedObject:function(key){return SC.ENVIRONMENT&&SC.ENVIRONMENT.published&&SC.ENVIRONMENT.published[key]?SC.ENVIRONMENT.published[key]:null}};function loadScript(data){'use strict';var element;if(data.url){element='<script src="'+data.url+'"></'+'script>'}else{element='<script>'+data.code+'</'+'script>'}if(data.seo_remove){document.write(element)}else{document.write('</div>'+element+'<div class="seo-remove">')}}
</script>
</head>
  <body>
    <noscript>
      <div class="checkout-layout-no-javascript-msg">
        <strong>Javascript is disabled on your browser.</strong><br>
        To view this site, you must enable JavaScript or upgrade to a JavaScript-capable browser.
      </div>
    </noscript>
    <div id="main" class="main"></div>
    <script>loadScript({url: '/c.1283670/store/checkout.environment.ssp?lang=en_US&cur=USD&t=' + (new Date().getTime())});
    </script>
    <script>if (!~window.location.hash.indexOf('login-register') && !~window.location.hash.indexOf('forgot-password') && 'login-register'){window.location.hash = 'login-register';}
    </script>
    <script src="/c.1283670/store/javascript/checkout.js?t=1484321730904">  </script>
    <script src="/cms/2/assets/js/postframe.js"></script>
    <script src="/cms/2/cms.js"></script>
    <script>SCM['SC.Checkout'].Configuration.currentTouchpoint = 'login';</script>
</body>
</html>
Community
  • 1
  • 1
Shashwat Aryal
  • 126
  • 1
  • 11
  • 1
    Debug it by printing out the value of `rs.content`. The resulting tree may not be what you think it is. Then, attempt to match is portion of your xpath: `"//*[@id='site-header']"`, then `"//*[@id='site-header']/div[3]"`, etc... to see where your xpath fails to match. – pbuck Feb 01 '17 at 18:07
  • @Peter I did do that. I'll edit the question to post the results. Nothing i expected is in the results. Thankyou so much for the quick reply! – Shashwat Aryal Feb 01 '17 at 18:11
  • @Peter Is it not working because of there being no Javascript? – Shashwat Aryal Feb 01 '17 at 18:13
  • 1
    Because no javascript? Yes and no. The actual document you retrieve is normally interpreted by the browser: it contains a lot of javascript, loading other JS files. Presumably, THOSE javascript files build up the DOM which will match your xpath. So, you'll either need to load and execute those javascript files (and build the DOM) or look at those JS files to see how they calculate the href you're looking for. (That can be a lot of work!!!) Or, scrape using a browser via Selenium (which is much slower). – pbuck Feb 01 '17 at 18:47
  • @Peter Ok great! I'll look into those. I'm avoiding using Selenium as much as possible because of the exact reason you stated, its sluggish. If i provide you with the password and username I think you might be able to say confidently what will exactly work. Could you look into it and point me in the right direction? – Shashwat Aryal Feb 01 '17 at 19:11
  • 1
    Selenium will work, have no doubt. You'd still need to write python (or other ... selenium is multi-lingual) to drive the browser and wait for the DOM to get fully loaded. It's error prone because of timing issues (waiting for code to load, waiting for code to execute, etc.) which is non-deterministic. The end result is you're trying to scrape a site which doesn't want to make it easy. (and they can make it infinitely harder if they want). Try Selenium for fun, but you might also investigate if the site provides an API to avoid scraping, or find a different target. – pbuck Feb 01 '17 at 20:13
  • Just so you know, the login url for POST request should be `https://checkout.reginaandrew.com/c.1283670/store/services/Account.Login.Service.ss?n=2&c=1283670&n=2`, you're currently sending POST request to login landing page – Shane Feb 02 '17 at 04:30
  • @Peter I tried everything I could with requests. Just couldn't seem to make it work. Ill have to work with selenium for now. Thanks a lot! – Shashwat Aryal Feb 04 '17 at 17:44
  • @Peter If you could write a simple answer I'd be glad to accept yours as you helped me so much! – Shashwat Aryal Feb 04 '17 at 17:46

2 Answers2

1

This is going to be quite tricky and you might want to use a more sophisticated tool like Selenium to be able to emulate a browser.

Otherwise, you will need to investigate what cookies or other type of authentication is required for you to log in to the site. Note all the cookies that are being passed behind the scenes -- it's not quite as simple as just entering in the username/password to be able to log in here. You can see what information is being passed by viewing the Network tab in your web browser.

enter image description here

Finally, if you are worried that Selenium might be 'sluggish' (it is -- after all, it is doing the same thing a user would be doing when opening a browser and clicking things), then you can try something like CasperJS, though the learning curve to implement something with this is quite steeper than Selenium -- you might want to try with Selenium first.

David542
  • 104,438
  • 178
  • 489
  • 842
  • Doing everything with requests was just too much for me. I think someone with a fair bit of knowledge could accomplish it. But in the end I had to go with selenium. Thanks! – Shashwat Aryal Feb 04 '17 at 17:43
1

Scraping sites can be hard.

Some sites send you well-formed HTML and all you need to do is search within it to find data / links, whatever you need for scraping.

Some sites send you poorly-formed HTML. Browsers, over the years have become pretty excepting of "bad" HTML and do the best they can to interpret what the HTML is trying to do. The down-side is if you're using a strict parser to decipher the HTML it may fail: you need something able to work with fuzzy data. Or, just brute force with regex. Your use of xpath only works if the resulting HTML creates a well-formed XML document.

Some sites (more and more these days) send a bit of HTML, and javascript, and perhaps JSON, XML, whatever to the browser. The Browser then constructs the final HTML (the DOM) and displays it to the user. That's what you have here.

You want to scrape the final DOM, but as that's not what the site is sending you. So, you either need to scrape what they send (for example, you figure out that the link you want can be determined from the JSON they send {books: [{title: "Graphs of Wrath", code: "a88kyyedkgH"}]} ==> example.com/catalog?id=a88kyyedkgH. Or, you scrape through a browser (e.g. using Selenium), letting the browser do all the requests, build up the DOM and then you scrape the result. It's slower, but it works.

When it gets hard, consider:

  1. The site probably doesn't want you to be doing this & (we) webmasters have just as many tools to make your life harder and harder.
  2. Alternatively, there may be a published API designed for you to get most of the information (Amazon is a great example). (My guess is Amazon knows it can't beat all the webscrapers, so it's better for them to offer a way which doesn't consume so many resources on their main servers.)
pbuck
  • 4,291
  • 2
  • 24
  • 36
  • In the end I had to go with selenium. I really didn't want to but that's the only thing I could get to work – Shashwat Aryal Feb 04 '17 at 18:18
  • Can something like logging in from selenium and passing the cookies [or something else] to requests and then scrapping from requests be done? – Shashwat Aryal Feb 04 '17 at 19:25
  • Can be done, but not likely to be useful. You want the resulting DOM which exists only in the browser & accessible to you only via Selenium. So, you need to use Selenium commands to query the browser (`el = driver.find_element_by_xpath()`, `el = driver.find_element_by_id('my_button')`), enter data (`el.send_keys('AdminUser")`), click buttons (`el.click()`), etc. – pbuck Feb 04 '17 at 19:41