19

I'd like to download a web pages while supplying URLs from stdin. Essentially one process continuously produces URLs to stdout/file and I want to pipe them to wget or curl. (Think about it as simple web crawler if you want).

This seems to work fine:

tail 1.log | wget -i - -O - -q 

But when I use 'tail -f' and it doesn't work anymore (buffering or wget is waiting for EOF?):

tail -f 1.log | wget -i - -O - -q

Could anybody provide a solution using wget, curl or any other standard Unix tool? Ideally I don't won't want to restart wget in the loop, just keep it running downloading URLs as they come.

Zombo
  • 1
  • 62
  • 391
  • 407
maximdim
  • 8,041
  • 3
  • 33
  • 48

4 Answers4

12

What you need to use is xargs. E.g.

tail -f 1.log | xargs -n1 wget -O - -q
Kyle Jones
  • 5,492
  • 1
  • 21
  • 30
  • With `xargs` `wget` receives the URL as a parameter so you do not need `-i -` anymore. `tail -f 1.log | xargs -n1 wget -O - -q` – pabouk - Ukraine stay strong Aug 22 '13 at 14:57
  • this will start a new wget process per URL – Neil McGuigan Aug 31 '16 at 19:20
  • If this is running on a shared machine, you might like to know that any other user can read your parameters using the "ps" command, so don't put passwords etc in your URLs. Use one of the solutions that does not involve turning stdin into parameters if this might be a problem (admins with root access to the machine could of course still check which URLs you're fetching, but presumably you trust the admins more than you trust random other users). – Silas S. Brown Nov 29 '17 at 13:02
1

Use xargs which converts stdin to argument.

tail 1.log | xargs -L 1 wget
Rajendran T
  • 1,513
  • 10
  • 15
  • As I commented on the other answer: if this is running on a shared machine, you might like to know that any other user can read your parameters using the "ps" command, so don't put passwords etc in your URLs. Use one of the solutions that does not involve turning stdin into parameters if this might be a problem (admins with root access to the machine could of course still check which URLs you're fetching, but presumably you trust the admins more than you trust random other users). – Silas S. Brown Nov 29 '17 at 13:03
0

Try piping the tail -f through python -c $'import pycurl;c=pycurl.Curl()\nwhile True: c.setopt(pycurl.URL,raw_input().strip()),c.perform()'

This gets curl (well, you probably meant the command-line curl and I'm calling it as a library from a Python one-liner, but it's still curl) to fetch each URL immediately, while still taking advantage of keeping the socket to the server open if you're requesting multiple URLs from the same server in sequence. It's not completely robust though: if one of your URLs is duff, the whole command will fail (you might want to make it a proper Python script and add try / except to handle this), and there's also the small detail that it will throw EOFError on EOF (but I'm assuming that's not important if you're using tail -f).

Silas S. Brown
  • 1,469
  • 1
  • 17
  • 18
0

The effective way is to avoid using xargs, if downloading files from the same web server:

wget -q -N -i - << EOF
http://sitename/dir1/file1
http://sitename/dir2/file2
http://sitename/dir3/file3
EOF
smci
  • 32,567
  • 20
  • 113
  • 146
bo0k
  • 123
  • 1
  • 1
  • 6