0

I'm new to Python, I wanted to web scrape my modem and reach its DOM so then I can collect some status, but I don't know how it's done, is it possible to web scrape this local device through its IP address, 192.168.1.1?

And another thing is that, when you open up that IP, it shows up this alert message to log in, I don't know how should I fill it with scrapy enter image description here

This is what I've written, but it's not working, the res.html file gets created but it's empty

import scrapy

class ScrapperSpider(scrapy.Spider):
    handle_httpstatus_list = [401]
    name = "scrapper"
    start_urls = ["http://192.168.1.1/"]

    auth = "Basic YWRtaW46YWRtaW4="

    def parse(self, response):
        return scrapy.Request(
            "http://192.168.1.1/",
            headers={'Authorization': self.auth},
            callback=self.after_login
        )
    
    def after_login(self, response):
        with open('res.html', 'wb') as f:
            f.write(response.xpath('//*[@id="box_header"]/tbody/tr[1]/td').extract())

I got the response content with response.text and here it is:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META HTTP-EQUIV="pragma" CONTENT="no-cache">
<title>ADSL Router</title>
<script language="javascript" src="util.js"></script>
<script>
function closeWindow(){
var currBrowser;
currBrowser = GetBrowserOS();
switch(currBrowser)
{
case "msiewin":
case "msiemac":
case "netslin":
window.opener = self;
window.close();
break;
case "netswin":
case "firelin":
case "firewin":
case "firemac":
window.open('','_parent','');
window.close();
break;
default:
window.opener = self;
window.close();
break;
}
}
function op() {}
</script>
</head>
<blockquote>
<frameset rows="0,*" frameborder="0" framespacing="0">
<frame name="fPanel" src="" scrolling="auto" marginwidth="0" marginheight="0">
<frame name="main" src="internet.htm">
<noframes>
<body bgcolor="#008080">
<p>This page uses frames, but your browser doesn't support them.</p>
</body>
</noframes>
</frameset>
</blockquote>
</html>

I don't know how can I be sure that I passed the authorization or not, I don't even know if I'm sending the right request, I've inspected the network tab while I was logging in, but there were no POST request in any of the files, the only part that seems to be related to logging in was the Authorization: Basic YWRtaW46YWRtaW4= in the request headers, But I think I must be logged in right? because the response has these contents

I used the codes of this question btw: Scrapy to bypass an alert message with form authentication

EDIT: Nevermind, I think it actually logs in, because I inspected the contents of a request to http://192.168.1.1/internet.htm and it has the content of the first page of the modem, Now I should see how can I switch to other tabs and etc...

EDIT: there is no need to switch tabs... I just hovered the mouse on the page that I needed, and it's located at http://192.168.1.1/adslconfig.htm, I sent a request to there and I got everything that I needed in the response.text

Done!

bzmind
  • 386
  • 3
  • 19
  • 1
    "*is it possible to web scrape this local device through its IP address, 192.168.1.1?*" I can't imagine why it *wouldn't* be, can you elaborate on why you'd believe otherwise or point to an authoritative source which states that this *shouldn't* be possible? Have you inspected the contents of the `response` to ensure that the `td` element you're after is present there, and that it's not being loaded in via JavaScript? If so, have you followed the [Scrapy documentation on how to extract dynamic content](https://docs.scrapy.org/en/latest/topics/dynamic-content.html)? – esqew Sep 27 '21 at 13:20
  • 1
    This looks like a job for `selenium` – Chris Sep 27 '21 at 13:28
  • If res.html is being created and it's empty then it's obvious that the *extract()* function has returned nothing from your xpath query. That's where you need to be looking. –  Sep 27 '21 at 13:33
  • @esqew I thought that it might be different because it's local, idk, my concern is that I'm not sure if I passed the authorization, and I don't know how can I be sure that I passed the authorization, how can I inspect the response? – bzmind Sep 27 '21 at 17:19
  • *print()* the *response* variable –  Sep 27 '21 at 17:43
  • Yeah I just wrote response.text instead of response.body() or body().text, and there is just some javascript code and a little bit of HTML, I'll add the response content to my question – bzmind Sep 27 '21 at 17:49
  • Now that you've printed the response, it's clear what your problem is –  Sep 27 '21 at 18:05
  • What HTTP status code is being returned on the response? It seems you’ve successfully authenticated using Basic Auth, but, as I and other commenters suspected, the page that’s returned once you’ve authenticated is painted entirely using JavaScript. This type of page cannot be interpreted or consumed with your current stack. – esqew Sep 27 '21 at 18:07
  • @esqew I just sent a request to the page that I needed, at http://192.168.1.1/adslconfig.htm and I got everything that I needed in its response, Thanks for your help – bzmind Sep 27 '21 at 18:13
  • I know nothing about *scrapy* but I do know a little about *selenium* and I suspect that you'll find the latter of these 2 modules most helpful in this case –  Sep 27 '21 at 18:13
  • @BrutusForcus Honestly, I don't even know which one is better, I just want the most capable option, it's not important if it's not user friendly etc... – bzmind Sep 27 '21 at 18:15
  • Unfortunately, no one here can provide a definitive answer as we don't have access to your router. We do know its username and password though –  Sep 27 '21 at 18:35
  • @BrutusForcus No I added to my question that I sent a request to the page that I needed, and I got the status that I needed, it's done, thanks – bzmind Sep 27 '21 at 18:38

0 Answers0