0

I am trying to write a code in python that will search the html code for an image link, the code i need to find i - . I need to find the http://www.darlighting.co.uk/621-large_default/empire-double-wall-bracket-polished-chrome.jpg part regardless of the what the link actually says, is there anyway to do this or should i look into a different method? I have access to the standard python modules and beautifulsoup.

  • So you need to find exactly that image on the webpage(in the HTML)? No matter what the URL of the image will be? – 4d4c Jan 22 '14 at 13:34
  • Yeah pretty much, sorry if the wording is a bit weird. – user3223643 Jan 22 '14 at 13:36
  • To compare images from web page you can download them and use [compare](http://www.imagemagick.org/script/compare.php). Or check this [question](http://stackoverflow.com/questions/1927660/compare-two-images-the-python-linux-way) – 4d4c Jan 22 '14 at 13:46
  • I'm not looking to compare images and download them, I don't have an image to compare with, I just need a way for python to find the URL for me and then I can use another program I've written to download the image for me. Thanks for the reply though :) – user3223643 Jan 22 '14 at 13:50

3 Answers3

0

You can try using lxml(http://lxml.de/) and xpath (http://en.wikipedia.org/wiki/XPath)

for example to find images inside the html you can

import lxml.html
import requests

html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
    print images[0].get('src') # here I get the src from the first img
else:
    print "Images not found"

I hope this can help you something.

UPDATE: I fix the else before don't have ":"

Francisco Lavin
  • 928
  • 1
  • 8
  • 14
0

Beautiful Soup documentation has nice "Quick Start" section: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

from bs4 import BeautifulSoup as Soup
from urllib import urlopen

url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)

# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>

# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
    print img_tag['src']
iljau
  • 2,151
  • 3
  • 22
  • 45
0
import httplib
from lxml import html

#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS... 
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
    conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()

if res.status == 200:
    source = res.read()
else:
    print res.status
    print res.getheaders()

#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
    if att == 'src': #or 'href'
        print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)
lmokto
  • 131
  • 9