1

I would like to extract the image URL from page's html code using bash commands and then download all images from that page. I am not sure whether it is possible, as sometimes they ae stored in folders which I wouldn't have access to. But is it possible to download them from the source code?

I have written this so far:

wget -O plik.txt $1 
grep *.jpg plik.txt > wget
grep *.png plik.txt > wget
grep *.gif plik.txt > wget
rm plik.txt```
julswion
  • 11
  • 1
  • You cannot parse HTML code from a Bash script. To parse any markup language, you need specific parsers (same for HTML, XML, SGML, JSON, YAML, INI and even CSV require specific parsers). – Léa Gris Mar 20 '22 at 17:13
  • Try https://superuser.com/questions/1219455/how-to-download-all-images-from-a-website-using-wget or https://stackoverflow.com/questions/4602153/how-do-i-use-wget-to-download-all-images-into-a-single-folder-from-a-url – MichalH Mar 20 '22 at 17:14

1 Answers1

2

Using lynx (a text web browser) in non-interactive mode, and GNU xargs:

#!/bin/bash

lynx -dump -listonly -image_links -nonumbers "$1" |
grep -Ei '\.(jpg|png|gif)$' |
tr '\n' '\000' |
xargs -0 -- wget --no-verbose --
  • This will start downloading matching image URLs in the web page URL given in $1, straight away.

  • It will include both images in the page, and images that are linked. Removing -image_links will skip images on the page.

  • You can add/remove whichever extensions you want to download, following the pattern I provided for .jpg, .png, and .gif. (grep -i is case insensitive).

  • The reason for using null delimiters (via tr) is to use xargs -0, which will avoid problems with URLs which contain a single quote/apostrophe (').

  • The --no-verbose flag for wget just simplifies the log output. I find it easier to read if downloading a large list of files.

  • Note that regular GNU wget will handle any duplicate filenames, by appending a number (foo.jpg.1 etc). However, busybox wget for example just exits if a filename exists, abandoning further downloads.

  • You can also modify the xargs to just print a list of the files to be downloaded, so you can review it first: xargs -0 -- sh -c 'printf "%s\n" "$@"' _

dan
  • 4,846
  • 6
  • 15
  • Thanks, this is exactly what I need. Just note that if you get paths with query parameters like `.../image.jpg?size=medium` you should remove the `$` from the `grep` portion – Sridhar Sarnobat Aug 17 '23 at 17:23