how can i get all links from a html file with python using easyhtmlparser?

Question

i'm trying to get all the links and images from a page using html parser http://easyhtmlparser.sourceforge.net/

fd = open('file.html', 'r')
data = fd.read()
fd.close()
html = Html()
dom = html.feed(data)
for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['ref']

Are you married to easyhtmlparser? Beautiful Soup is my hero. — vroomfondel, Jul 03 '13 at 07:50

vroomfondel · Answer 1 · 2013-07-03T08:22:33.633

1

Well, I don't particularly want to read the docs for easyhtmlparser, but if you're willing to use Beautiful Soup:

from bs4 import BeautifulSoup
fd = open('file.html', 'r')
data = fd.read()
fd.close()
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href')) #or do whatever with it

should work, but I haven't tested it. Good luck!

Edit: Now I have. It works.

Edit 2: To find an image, search for all the image tags and such, find the src links. I trust you can find how in the Beautiful Soup or easyhtmlparser docs.

To download and put into a folder,

import urllib
urllib.urlretrieve(IMAGE_URL, path_to_folder/imagename)

or you could just read from urllib, since in the end everything is just a string, and read is more straightforward than retrieve.

edited Jul 03 '13 at 08:22

answered Jul 03 '13 at 07:57

vroomfondel

3,056
1
21
32

well. i tried beautiful soup but easyhtmlparser docs seemed simpler. i particularly dont like beautifulsoup it doesnt seem to have other methods to handle other things. anyway its fine. i will keep trying here. – tau Jul 03 '13 at 08:01
@barroieuoeiru whatever works for you. It looks to me as though Beautiful Soup has more features, is more reliable, and is better documented though. – vroomfondel Jul 03 '13 at 08:05
i think i know why my code wasnt working. i was using 'ref' instead of 'href'. apparently i could use the method dom.find('a') to iterate over all the links. – tau Jul 03 '13 at 08:05
but i still dont get how i could get all the links for images. i would like to download them into a folder. – tau Jul 03 '13 at 08:07
well, look for img tags instead. As far as downloading them into a folder goes, http://stackoverflow.com/questions/3042757/downloading-a-picture-via-urllib-and-python is helpful. – vroomfondel Jul 03 '13 at 08:12

score 0 · Answer 2 · answered Jul 03 '13 at 08:16

I would do it like this.

from ehp import *

with open('file.html', 'r') as fd:
    data = fd.read()

html = Html()
dom = html.feed(data)

for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['href']
    elif ind.name == 'img':
        print ind.attr['src']

how can i get all links from a html file with python using easyhtmlparser?

2 Answers2