12

I am downloading files with this script:

parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'

Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?

Something like:

if wget --spider $url 2>/dev/null; then
  #touch img.file
fi

should work, but I don't know how to combine this code with GNU Parallel.

Edit:

Based on Ole's answer I wrote this piece of code:

#!/bin/bash
do_url() {
  url="$1"
  wget -q -nc  --method HEAD "$url" && touch ./images/${url##*/}   
  #get filename from $url
  url2=${url##*/}
  wget -q -nc  --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url

parallel --progress -a urls.txt do_url {}

It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.

Example urls.txt:

http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded) http://host.com/HT11019.jpg (not works, only the main image is downloaded)

Adrian
  • 2,576
  • 9
  • 49
  • 97
  • 1
    Use the `--method HEAD` to send a `HEAD` request instead of a `GET` request. – chepner Feb 04 '18 at 14:57
  • Possible duplicate of https://stackoverflow.com/questions/12199059/how-to-check-if-an-url-exists-with-the-shell-and-probably-curl – iamauser Feb 06 '18 at 17:44
  • @iamauser Are you serious? In that question where is the word about checking sequence of files on the remote side? – Adrian Feb 06 '18 at 17:46
  • Yes, I am. I think your question should rather be how to loop over a sequence of files, because that's the input to each call by `wget/curl`. – iamauser Feb 06 '18 at 18:02
  • 2
    It is not nice to completely change your question after a few answers have been provided. This makes most of the answers provided here to look wrong. However, the problem is that you changed the question after they were provided. – darnir Feb 09 '18 at 10:40

5 Answers5

5

You may use curl instead to check if the URLs you are parsing are there without downloading any file as such:

if curl --head --fail --silent "$url" >/dev/null; then
    touch .images/"${url##*/}"
fi

Explanation:

  • --fail will make the exit status nonzero on a failed request.
  • --head will avoid downloading the file contents
  • --silent will avoid status or errors from being emitted by the check itself.

To solve the "looping" issue, you can do:

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if curl --head --silent --fail "$url" > /dev/null; then
        touch .images/${url##*/}
    fi
done
silel
  • 567
  • 2
  • 10
AnythingIsFine
  • 1,777
  • 13
  • 11
5

It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.

I have urls.txt containing:

http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg

On example.com these URLs exist:

http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg

On example.org these URLs exist:

http://example.org/dira/foo_001.jpg

Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:

http://example.com/dira/foo.jpg

becomes:

http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg

Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.

If the URL exists I want an empty file created.

(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.

So the files created should be:

images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg

(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.

So the files created should be:

images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg

(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.

images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash

do_url() {
  url="$1"

  # Version 1:
  # If you want to keep the folder structure from the server (similar to wget -m):
  wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"

  # Version 2:
  # If all the images have unique names and you want all images in a single dir
  wget -q --method HEAD "$url" && touch images/"$3"

  # Version 3:
  # If all the images have unique names when _###.jpg is removed and you want all images in a single dir
  wget -q --method HEAD "$url" && touch images/"$4"

}
export -f do_url

parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:

parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg

You can also "unroll" the loop. This will save 5 overheads per URL:

do_url() {
  url="$1"
  # Version 2:
  # If all the images have unique names and you want all images in a single dir
  wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
  wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
  wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
  wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
  wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
  wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url

parallel -j0 do_url {.} :::: urls.txt

Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround

Community
  • 1
  • 1
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • It is not possible to save all images into images/ directory? I have a long URL and this script creates a weird folder structure. – Adrian Feb 10 '18 at 13:46
  • Added `images`. – Ole Tange Feb 11 '18 at 08:26
  • I needed the "Version 2". It works fine, thank you. I made a little benchmark and I am disappointed in the speed. It is much slower than downloading files, if you are interested, here is the result: https://pastebin.ca/3971248. What do you think, where is the bottleneck? – Adrian Feb 11 '18 at 14:31
  • With 250 jobs (-j0) the running time is now halfed, but unfortunately it is still slower compared to wget --no-clog (do not download if exists). But it is a great answer and I will definitely use it in the future. Something is weird with the latest example: $ ls images/ _001.jpg _002.jpg _003.jpg _004.jpg _005.jpg. – Adrian Feb 11 '18 at 19:17
3

From what I can see, your question isn't really about how to use wget to test for the existence of a file, but rather on how to perform correct looping in a shell script.

Here is a simple solution for that:

urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[@]}"; do
    if wget -q --method=HEAD "$url"; then
        touch .images/${url##*/}
    fi
done

What this does is that it invokes Wget with the --method=HEAD option. With the HEAD request, the server will simply report back whether the file exists or not, without returning any data.

Of course, with a large data set this is pretty inefficient. You're creating a new connection to the server for every file you're trying. Instead, as suggested in the other answer, you could use GNU Wget2. With wget2, you can test all of these in parallel, and use the new --stats-server option to find a list of all the files and the specific return code that the server provided. For example:

$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}                                                             
Site Statistics:

  http://example.com:
    Status    No. of docs
       404              3
         http://example.com/3  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
         http://example.com/1  0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
         http://example.com/2  0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
       200              1
         http://example.com/  0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)

You can even get this data printed as a CSV or JSON for easier parsing

darnir
  • 4,870
  • 4
  • 32
  • 47
  • @Finally I was able to compile Wget2. For a quick test I runned: wget2 --spider --progress=none --stats-site=csv:stat.csv ${url%.jpg}_{001..005}.jpg. It queries URL's fine (example.com/hello_001.jpg, etc.), but in stat.csv is only one, the last query + I think for main image (exampe.com/hello.jpg)) I still need to run Wget2 one more time. – Adrian Feb 10 '18 at 23:14
  • I am imagining if that Wget2 should work faster than Wget&Parallel. Currently Wget&Parallel&TouchDummyFile is slower than Wget&Parallel&DownloadFiles. Benchmark results are under @OleTange answer. – Adrian Feb 11 '18 at 14:37
  • It is possible that the parallel+touch is slower than just downloading the files if the images are very small (~5kB). This is because you still need to make a new connection to the server for each file you're testing and then start a new process. This is sometimes slower than just downloading said file. Wget2 should indeed be faster in this case since it needs to establish the connection exactly once. – darnir Feb 11 '18 at 20:16
  • 1
    The issue with the stats that you see is a bug. I'll make a report and it should be fixed within a day or two. Till then, if you don't use json or csv, you can still see the full stats – darnir Feb 11 '18 at 20:20
  • Thank you, I'll report back after the bug is fixed. – Adrian Feb 12 '18 at 06:35
2

Just loop over the names?

for uname in ${url%.jpg}_{001..005}.jpg
do
  if wget --spider $uname 2>/dev/null; then
    touch ./images/${uname##*/}
  fi
done
  • I asked this question, because I don't want to download any files just check on the remote side and make a local dummy file (with same name) if exists. – Adrian Feb 04 '18 at 14:28
-2

You could send a command via ssh to see if the remote file exists and cat it if it does:

ssh your_host 'test -e "somefile" && cat "somefile"' > somefile

Could also try scp which supports glob expressions and recursion.

Cole Tierney
  • 9,571
  • 1
  • 27
  • 35