How to search html for a link and print the link using python?

Question

I am trying to write a code in python that will search the html code for an image link, the code i need to find i - . I need to find the http://www.darlighting.co.uk/621-large_default/empire-double-wall-bracket-polished-chrome.jpg part regardless of the what the link actually says, is there anyway to do this or should i look into a different method? I have access to the standard python modules and beautifulsoup.

So you need to find exactly that image on the webpage(in the HTML)? No matter what the URL of the image will be? — 4d4c, Jan 22 '14 at 13:34
To compare images from web page you can download them and use [compare](http://www.imagemagick.org/script/compare.php). Or check this [question](http://stackoverflow.com/questions/1927660/compare-two-images-the-python-linux-way) — 4d4c, Jan 22 '14 at 13:46
I'm not looking to compare images and download them, I don't have an image to compare with, I just need a way for python to find the URL for me and then I can use another program I've written to download the image for me. Thanks for the reply though :) — user3223643, Jan 22 '14 at 13:50

Francisco Lavin · Answer 1 · 2014-01-22T14:41:59.750

0

You can try using lxml(http://lxml.de/) and xpath (http://en.wikipedia.org/wiki/XPath)

for example to find images inside the html you can

import lxml.html
import requests

html = requests.get('http://www.google.com/').text
doc = lxml.html.document_fromstring(html)
images = doc.xpath('//img') # here you can find the element in your case the image
if images:
    print images[0].get('src') # here I get the src from the first img
else:
    print "Images not found"

I hope this can help you something.

UPDATE: I fix the else before don't have ":"

edited Jan 22 '14 at 14:41

answered Jan 22 '14 at 13:47

Francisco Lavin

928
1
8
14

I'm trying it out now but keep getting Images not found, ill try and get it work, thanks for the help – user3223643 Jan 22 '14 at 14:07
I update adding ":" in the else, and testing I get "/images/srpr/logo9w.png" – Francisco Lavin Jan 22 '14 at 14:43

score 0 · Accepted Answer · answered Jan 22 '14 at 13:48

Beautiful Soup documentation has nice "Quick Start" section: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start

from bs4 import BeautifulSoup as Soup
from urllib import urlopen

url = "http://www.darlighting.co.uk/"
html = urlopen(url).read()
soup = Soup(html)

# find image tag with specific source
the_image_tag = soup.find("img", src='/images/dhl_logo.png')
print type(the_image_tag), the_image_tag
# >>> <class 'bs4.element.Tag'> <img src="/images/dhl_logo.png"/>

# find all image tags
img_tags = soup.find_all("img")
for img_tag in img_tags:
    print img_tag['src']

score 0 · Answer 3 · answered Jan 23 '14 at 19:18

import httplib
from lxml import html

#CONNECTION
url = "www.darlighting.co.uk"
path = "/"
conn = httplib.HTTPConnection(url)
conn.putrequest("GET", path)
#HERE YOU HEADERS... 
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)", "Cache-Control": "no-cache"}
for k, v in header.iteritems():
    conn.putheader(k, v)
conn.endheaders()
res = conn.getresponse()

if res.status == 200:
    source = res.read()
else:
    print res.status
    print res.getheaders()

#EXTRACT
dochtml = html.fromstring(source)
for elem, att, link, pos in dochtml.iterlinks():
    if att == 'src': #or 'href'
        print 'elem: {0} || pos {1}: || attr: {2} || link: {3}'.format(elem, pos, att, link)

How to search html for a link and print the link using python?

3 Answers3