0

On my linux terminal, I can simply run this command to download all pdfs from a website

wget -A pdf -m -p -E -k -K -np http://site/path/

but I to automate the process using Python on Windows, I was trying this script. Although the loop runs and prints (i), the wget command does not seem to run because it does not download anything. The cell runs for just 2 seconds. If wget was actually running and downloading all content, It would have taken a lot more time.

import os
lst = ['www.falk-ross.eu']

for i in lst:
    print(i)
    os.system('wget -A pdf -m -p -E -k -K -np %s' % i)

Why does wget not seem to work?

tink
  • 14,342
  • 4
  • 46
  • 50
x89
  • 2,798
  • 5
  • 46
  • 110
  • Do you have `wget` on your Windows machine? – AKX Jul 09 '21 at 08:03
  • Also, you're saying "The cell runs" – sounds like you're using a notebook. Maybe try with a plain .py script you run from the Windows command line to begin with? – AKX Jul 09 '21 at 08:03
  • Why not use python to download, rather than just spawning another process; that isn't very portable. I suggest using `urllib.requests`. – theherk Jul 09 '21 at 09:13
  • All working methods that I saw for Python download pdfs from a webpage, not a website@theherk – x89 Jul 09 '21 at 09:29

2 Answers2

0

To directly answer your question, I see 2 ways here:

  • Run the Python script directly in the Linux terminal, where you know wget is correctly set up and it works
  • Install wget for Windows, for example from http://gnuwin32.sourceforge.net/packages/wget.htm , and either add the folder where the wget.exe file is to your PATH, or specify the entire wget.exe path in the os.system call

I could be more precise in my answer if I knew what Linux terminal are you using (WSL? Cygwin? A Linux virtual machine? All of those will have different behavior). However, a general rule is that your Linux shell will probably not be configured exactly as your Windows environment: they have different env variables and, usually, they even don't share the same executables.

All of this is just so you know the possible reasons why it could be not working.

However, I would suggest you to use a more Pythonic way, like the ones described in Download all pdf files from a website using Python

Calling another executable from your code is usually a bad habit, mainly because:

  • you can never be sure the configuration on a target system will allow it, or will however handle it nicely (the target executable can not exist, or exist at another path, its execution can be forbidden, ...)
  • it makes it a lot harder to detect error and to create a predictable behavior in your flow
AleRinaldi
  • 415
  • 4
  • 11
0

Even in the documentation of os module, it is recommended to use subprocess module.

I had some issues spawning the necessary shell and getting the results. Subprocess module have fixed my problems.

Additionally, you can directly use requests module.

Global
  • 46
  • 3