0

So using my linux terminal, I can run a command to download all pdfs from a website

wget -A pdf -m -p -E -k -K -np http://site/path/

but I want to automate the process. For example run the command for multiple urls and then process the downloaded files later using Python/Jupyter notebook. The wget library in Python is different and it does not allow me to use the same options/parameters that I can use in wget on my Linux machine. So, how can I achieve the same thing using Python?

Stefan
  • 1,697
  • 15
  • 31
x89
  • 2,798
  • 5
  • 46
  • 110

2 Answers2

2

You can just use the os library so it would look something like this

import os
os.system('wget -A pdf -m -p -E -k -K -np http://site/path/')

And with that you are just passing a command to the system.

Stefan
  • 1,697
  • 15
  • 31
  • what if I want to use it in a loop, for example use it for multiple urls in a list? – x89 Jul 07 '21 at 12:35
  • You call it once for each web site in the loop. **You** are building the string to be passed to `System`, so **you** decide what's inside the string. Just try it, and if it does not work, ask a new question by showing your failed attempt. – user1934428 Jul 07 '21 at 12:37
  • You can just loop throw the list, something like this https://gist.github.com/abodsakah/550ed2bae02e7f8c744b1062ef0b2620 – Abdulrahman Sakah Jul 07 '21 at 12:41
  • I tried this but it does not download any command with wget as it is supposed to. Nothing happens. – x89 Jul 08 '21 at 08:28
  • You can look at this library maybe it is better https://pypi.org/project/wget/ – Abdulrahman Sakah Jul 08 '21 at 13:45
  • You should generally prefer `subprocess` ove `os.system()`; the documentation for the latter recommends this, too. One of the benefits is that you can avoid an unnecessary shell; for details, see e.g. [Actual meaning of `shell=True` in `subprocess`](https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess) – tripleee Jul 22 '21 at 10:09
0

You don't need Python for that.

#!/bin/bash
for url in "http://site/path/" "https://example.com/another"
do
    wget -A pdf -m -p -E -k -K -np "$url"
done
tripleee
  • 175,061
  • 34
  • 275
  • 318