0

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/

It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.

http://www.bvmjets.com/

This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.

Following the instructions, I got the XPath for one of the images.

/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img

The whole script looks like:

from lxml import html
import requests

page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)

images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')

print(images)

But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.

Bob Wakefield
  • 3,739
  • 4
  • 20
  • 30

1 Answers1

1

I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:

tree.xpath('//html//body//div//div//table//tr//td//div//a//img[@src]')

or

tree.xpath('//table//tr//td//div//img[@src]') 

or

tree.xpath('//img[@src]') # 68 images

The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.

Note that the [@src] allows us to now access the source of that image. Using this post we can now download any/all image we want:

import shutil
from lxml import html
import requests

page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[@target=\'_blank\']//img[@src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk

r = requests.get(source_url, stream=True)
if r.status_code == 200:
    with open(path, 'wb') as f:
        r.raw.decode_content = True
        shutil.copyfileobj(r.raw, f) 

I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.

This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

F. Elliot
  • 228
  • 2
  • 10