-2

i'm trying to get all the links and images from a page using html parser http://easyhtmlparser.sourceforge.net/

fd = open('file.html', 'r')
data = fd.read()
fd.close()
html = Html()
dom = html.feed(data)
for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['ref']
tau
  • 11
  • 3

2 Answers2

1

Well, I don't particularly want to read the docs for easyhtmlparser, but if you're willing to use Beautiful Soup:

from bs4 import BeautifulSoup
fd = open('file.html', 'r')
data = fd.read()
fd.close()
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href')) #or do whatever with it

should work, but I haven't tested it. Good luck!

Edit: Now I have. It works.

Edit 2: To find an image, search for all the image tags and such, find the src links. I trust you can find how in the Beautiful Soup or easyhtmlparser docs.

To download and put into a folder,

import urllib
urllib.urlretrieve(IMAGE_URL, path_to_folder/imagename)

or you could just read from urllib, since in the end everything is just a string, and read is more straightforward than retrieve.

vroomfondel
  • 3,056
  • 1
  • 21
  • 32
  • well. i tried beautiful soup but easyhtmlparser docs seemed simpler. i particularly dont like beautifulsoup it doesnt seem to have other methods to handle other things. anyway its fine. i will keep trying here. – tau Jul 03 '13 at 08:01
  • @barroieuoeiru whatever works for you. It looks to me as though Beautiful Soup has more features, is more reliable, and is better documented though. – vroomfondel Jul 03 '13 at 08:05
  • i think i know why my code wasnt working. i was using 'ref' instead of 'href'. apparently i could use the method dom.find('a') to iterate over all the links. – tau Jul 03 '13 at 08:05
  • but i still dont get how i could get all the links for images. i would like to download them into a folder. – tau Jul 03 '13 at 08:07
  • well, look for img tags instead. As far as downloading them into a folder goes, http://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python is helpful. – vroomfondel Jul 03 '13 at 08:12
0

I would do it like this.

from ehp import *

with open('file.html', 'r') as fd:
    data = fd.read()

html = Html()
dom = html.feed(data)

for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['href']
    elif ind.name == 'img':
        print ind.attr['src']