-1

I'm trying to build mechanism which will scan a website at a given URL and get all images. Currently I'm using simple_html_dom which is slow.

Scanning a website from localhost is taking me about 30s - 1 min.

What I need to do is:

  1. load a URL.
  2. scan for images ( if its posible with specific size x > width )
  3. print them.

I'm looking for fastest way.

Wesley van Opdorp
  • 14,888
  • 4
  • 41
  • 59
  • This is a non-trivial task and you have pretty much the fastest way. Well, the fastest way available in PHP, in any case. – DaveRandom Jan 23 '12 at 12:31
  • file_get_contents and preg_match_all should do the trick – Geert Jan 23 '12 at 12:39
  • parsing remote page for images: http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662 – Gordon Jan 23 '12 at 12:45
  • getting size of remote image: http://stackoverflow.com/questions/6272663/php-how-to-get-web-image-size-in-kb – Gordon Jan 23 '12 at 12:48

2 Answers2

3

There is no fastest way. You cannot reduce network latency. You cannot avoid getting image to detect its size. The rest of operations already being a negligible part of process.

alex
  • 479,566
  • 201
  • 878
  • 984
Your Common Sense
  • 156,878
  • 40
  • 214
  • 345
  • After researches I think it is... What if You will use javascript on site that You are looking for images ? – Krzysztof Aba Jan 23 '12 at 17:09
  • Based on the room for improvements in the larger problem domain, this answer is too straightforward, I think. – Elliott Apr 12 '13 at 00:07
1

The other answer is oversimplified because you can reduce the overall network throughput by sending HEAD requests to the server to get the image size before downloading it -- immediately saving you almost all of the bandwidth for images with size < x.

Depending on the size of the pages involved, the choice of string operations used to extract the image URLs could be important as well. PHP's perfectly adequate for the needs it caters for but it's still a moderately slow, interpreted language at the end of the day and I find calling routines which involve moving large substrings around appreciably laggy sometimes. In this case parsing it fully, even using a simple library, is overkill.

The reason I would go to extreme lengths to download only the bare minimum of images is that some PHP methods for doing so are very slow. If I use copy() to download a file and then do the same thing using raw sockets or cURL, copy() sometimes takes at least twice as long.

So choice of transfer method and choice of parsing method both have a noticeable effect.

Elliott
  • 1,127
  • 1
  • 9
  • 16