1

I am having a problem downloading a file from a specific Korean URL. When I googled how to download files through URL, it recommended many solutions such as using urlretrieve, urlopen, wget. However, whenever I try, it saves a 0 byte PDF file and it does not return any error message.

So I tired using other program such as Postman or J2downloader and they saved pdf.do with 0 byte. I know .do could be opened with Acrobat Reader but the size tells me that it was not able to download the contents.

The URL of the site is http://dart.fss.or.kr/pdf/download/pdf.do?rcp_no=20210218000576&dcm_no=7808922. If I open it through a website, it downloads correctly.

Now I am not sure whether it's my code problem or the site mechanism is different. If it's the websites mechanism, could you tell me how to make it work on using Python?

Code that I tried

final_url = http://dart.fss.or.kr/pdf/download/pdf.do?rcp_no=20210218000576&dcm_no=7808922
1. 
    urlretrieve(final_url, "./down2.pdf")
2.
    with open("down.pdf",'wb') as file:
        response = requests.get(final_url, allow_redirects=True)
        print(response.content)
        file.write(response.content)
3. 
    mem = urlopen(final_url).read()
    with open("down.pdf",'wb') as file:
        file.write(mem)
        file.close()
4.
    wget.download(final_url, "my download folder")
bad_coder
  • 11,289
  • 20
  • 44
  • 72
attat
  • 57
  • 9
  • Do any of the alternatives in [this post](https://stackoverflow.com/questions/15035123/what-command-to-use-instead-of-urllib-request-urlretrieve) work? – Random Davis Feb 18 '21 at 19:06
  • no... I tried all of the alternatives in the post you provided but it only downloaded 0byte pdf file @RandomDavis – attat Feb 18 '21 at 19:10
  • Did you try [this](https://stackoverflow.com/questions/44628699/how-to-download-a-file-from-a-url-which-redirects)? I think all the techniques you're trying only work for a non-redirecting direct file link, which yours might not be. – Random Davis Feb 18 '21 at 19:25
  • I also thought it might be a redirection issue and I tried `allow_redirects=True` but it did not work and also it had same url even after adding that parameter. – attat Feb 18 '21 at 19:31
  • I'd take a look at these and see if any of them help: https://stackoverflow.com/q/3988951 https://stackoverflow.com/q/25430219 Also when downloading the file, what's the actual download URL? It could be something predictable. – Random Davis Feb 18 '21 at 19:39
  • When I looked for headers it was the result `{'Date': 'Thu, 18 Feb 2021 20:00:07 GMT', 'Set-Cookie': 'WMONID=s8FRMj229G0; Expires=Sat, 19-Feb-2022 5:0:7 GMT; Path=/, PDFJSESSIONID=3BK384GDNwYyLXcM3xtag1IeqQX1Xy4tw17iCGOXVpaehxHcJlRK9Q29vdfVRjlo.ZG1fcGRmL2ZpbGVyM19wZGZfbXMx; Path=/pdf; HttpOnly', 'Connection': 'keep-alive', 'Content-Length': '0'}` I don't exactly see that it is pdf type but the downloads gives pdf file.The link that I posted is the actual downlad URL. When clicked, it automatically downloads the file. – attat Feb 18 '21 at 20:02
  • 2
    I was wondering whether the 'user-agent' might affect the result so I copied the user-agent from the chrome and added to the program and it worked. Guess the website only wants browsers to access it. Thank you for the help @RandomDavis, it made me think of adding headers. – attat Feb 18 '21 at 20:13
  • Cool, you can self-answer this question with what worked if you want. – Random Davis Feb 18 '21 at 20:21

1 Answers1

1

The problem was that the website seems to only allow browsers to download the files. So the solution was to get the user-agent header from browser and input into the program.

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'}
response = requests.get(final_url, headers=headers)

with open("down.zip",'wb') as file:
    print(response.content)
    file.write(response.content)
attat
  • 57
  • 9