Issue with wget trying to get images from certain websites

Question

I am trying to download all images off this website path http://www.samsung.com/sg/consumer/mobile-devices/smartphones/ using the below code

wget -e robots=off -nd -nc -np --recursive -r -p --level=5 --accept jpg,jpeg,png,gif --convert-links -N --limit-rate=200k --wait 1.0 -U 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:14.0) Gecko/20100101 Firefox/14.0.1' -P testing_folder  www.samsung.com/sg/consumer/mobile-devices/smartphones

I would expect to see the images of the phones downloaded to my testing_folder.But all I see is some global images like logo etc. I dont seem to be able to get the phone images downloaded. The code above seems to work on some other websites through.

I have gone through all the wget questions on this forum but this particular issue doesnt seem to have an answer. Can someone help, I am sure there is a easy out. What am I doing wrong ?

UPDATE: It looks like it is an issue with possible javascript pages and hence seems like end of the road, since apparently wget cant handle javascript pages well. If anyone can still help, will be delighted.

Looks like those images don't have any extensions like jpg, jpeg etc. Inspecting the page doesn't show direct links to those images, that's probably why your script isn't working. — ronakg, Jun 17 '15 at 17:54
I haven't looked at the page, but it's entirely possible that the images are populated by javascript, which means that the page when fetched with `wget` would not contain those `img` links. Fetch the page with `wget` and examine the HTML source. — larsks, Jun 17 '15 at 17:55
ronakg, thanks. If i change the path to the below there is definitely an image there which i would like to scrape. http://www.samsung.com/sg/consumer/mobile-devices/smartphones/galaxy-s/SM-G920IZDAXSP However, this too doesnt seem to work — Vinu D, Jun 17 '15 at 17:57
@larsks, thanks. If this is the case, what is the way around ? How do i scrape images from websites such as these and i have many such examples where the code above doesnt work. — Vinu D, Jun 17 '15 at 18:10
[This page](http://stackoverflow.com/questions/5793414/mechanize-and-javascript) has some useful discussion on the topic, but the tl;dr is "it's complicated". — larsks, Jun 17 '15 at 18:46
The above website seems to be populating the images through Javascript. Wget's HTML parser has absolutely no support for parsing javascript and the developers have categorically stated multiple times that they do not intend to add support for it either. I don't know of any scraper which parses JS too. — darnir, Jun 17 '15 at 19:14
You'll probably have to use something more beefy like PhantomJS (headless webkit based browser that is scriptable) to pull down images that are populated via JS. — JNevill, Jun 17 '15 at 19:25

Joachim Wagner · Answer 1 · 2015-06-18T11:34:11.563

1

Steps:

configure a proxy server, for example Apache httpd with mod_proxy and mod_http_proxy
visit the page with a web browser that supports JavaScript and is configured to use your proxy server
harvest the URLs from the proxy server log file and put them in a file

Or:

Start Firefox and open web page
F10 - Tools - Page Info - Media - right click - select all - right click - copy
Paste into file with your favourite editor

Then:

optionally, (if you don't want to find out how to get wget read a list of URLs from a file), add minimal html tags (html, body and img) to the file
use wget to download the image specifying the file created in step 3 or 4 as the starting point

edited Jun 18 '15 at 11:34

answered Jun 18 '15 at 10:34

Joachim Wagner

860
7
16

@Jochim, thanks but steps 3,4,5 are what i am capable of doing myself. points 1 and 2 are beyond my abilities since i am a beginner. – Vinu D Jun 18 '15 at 11:01
1

How about these alternative steps? Do they capture all images? Looks ok to me but maybe not all images are loaded at this stage. – Joachim Wagner Jun 18 '15 at 11:35
Thanks for the alternative steps. Did exactly that with www.roca.in, but all i ended up getting are extra images and not the ones i require. appreciate the effort. – Vinu D Jun 18 '15 at 12:13

Issue with wget trying to get images from certain websites

1 Answers1