0

my python3 code:

import requests

url = sys.argv[1]
r = requests.get(url, stream=True)
chunk_size = 20000
with open('metadata.pdf', 'wb') as fd:
    for chunk in r.iter_content(chunk_size):
        fd.write(chunk)

It saves the content in metadat.pdf but that is not the real content of pdf, it is this html page:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html>
<!-- $HTMLid:   index.html /main/6 11-Jun-2004.13:54:09 $ -->
<head>
<title>Allied Waste</title>

<script language="JavaScript">
<!--
if (top != self) {
        top.location = self.location;
    }
function doRedirect() {
  document.login.submit();
} 

function init () {
    var initChar = /^\?/;
    var list = top.location.search.replace(initChar,"");
    var parms = list.split('&');
    for ( ct=0; ct < parms.length; ct++ ) {
        vals = parms[ct].split('=');
        switch ( vals[0] ) {
            case "unitCode":
                document.login.unitCode.value = unescape(vals[1]);
                if ( document.login.unitCode.value == 'undefined' || document.login.unitCode.value == '' )
                    document.login.unitCode.value = "ALW";
                break;
      default:
        document.login.unitCode.value = "ALW";
                break;
        }
    }
    document.login.submit();
}
//-->
</script>
</head>
<body onload="init()">
  <form name="login" action="inetSrv" method="post">
    <input type="hidden" name="type" value="SignonService"/>
    <input type="hidden" name="action" value="SignonPrompt"/>
    <input type="hidden" name="client" value="701122300"/>
    <input type="hidden" name="unitCode" value=""/>
  </form>
</body>
</html>

Any help, how I can save the real content of the file, not this html? It should be the real pdf, and when i download it it is jsut this html page

UPDATE:

aNSWER FROM THE SERVER when I use python sessions:

b'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n\n                                                                                                              \n<head><title></title>\n                     \n<LINK REL="StyleSheet" HREF="styles/mainStyle.css">\n</head>\n\n<body>\n<div style="float: left; border: 1px solid black; background-color: #FFFFFF; padding: 5px">\n\t<div class="TitleFont">Operation failed</div>\n\t<div class="TitleFont">Reason</div>\n\t<div>\n\t<div class="custom-message-box">\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >A server error has occurred.</div>\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >Error reference id: DLY-00716</div>\n\t\t\t\t<div class="ErrorFont" ALIGN="left" >Time: Wed Jul 15 05:33:12 CDT 2020</div>\n\t</div>\n\t</div>\n\t<div style="width: 600px">\n\t\t<p class="form-style-text">\n\t\tIf contacting customer support, please quote the above error reference id. You may be able to press the browser Back button to return to the previous screen. Otherwise you may need to login again. We apologize for the inconvenience.\n\t\t</p>\n\t</div>\n</div>\n\n</body>\n</html>\n\n'
  • Please share the url where the pdf file is located – mpx Jul 15 '20 at 09:41
  • https://secure3.billerweb.com/alw/inetSrv?sessionHandle=UnGuZSm86DqhEoYX1KSpSylhw1a-/D6ALu/L1mAIEAKBg8TZY2w9NzAxMTIyMzAwJlJlcXVlc3RUeXBlPVNob3dQZGY_&client=701122300&type=CompatPresentmentService&action=ShowPdf&firstPage=true –  Jul 15 '20 at 09:45

1 Answers1

0

It looks like the page is a redirection to the login page. It may be simpler to do it manually if you can.

Otherwise you will have to handle the login procedure in order to retrieve the authentification cookie it will give you (probably), which you then have to send along the get request for the intended pdf to be available.

Lenormju
  • 4,078
  • 2
  • 8
  • 22
  • I already have login details and I created the login procedure in the script, I have cookie.txt file too, but Im not sure how I need to pass it –  Jul 15 '20 at 09:38
  • 1
    Using a session (simple) : https://stackoverflow.com/a/31571805/11384184 or using the cookie file : https://stackoverflow.com/a/31555440/11384184 – Lenormju Jul 15 '20 at 09:43
  • I already have cookie.txt and when i do this response = requests.request("GET", url, cookies='cookie.txt') print(response.text.encode('utf8')) its not working –  Jul 15 '20 at 09:50
  • The cookies parameter value must be the cookie content (which is a dict of things), so you have to read it, and depending on what you have here maybe tweak it. That's why using a session is simpler. – Lenormju Jul 15 '20 at 09:52
  • I did it that way and I recivenew answer, please look update –  Jul 15 '20 at 10:35