How to scrape using Python a link from a html class

Question

I am attempting to grab the link from the website. Its the sound of the word. The website is http://dictionary.reference.com/browse/would?s=t

so I am using the following code to get the link but it is coming up up blank. This is weird because I can use a similar set up and pull data from a stock. The idea is to build a program that gives the sound of the word then I will ask for the spelling. This is for my kids pretty much. I needed to go through a list of words to get the links in a dictionary but having trouble getting the link to print out. I'm using urllib and re code below.

import urllib
import re
words = [ "would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)
    htmltext = htmlfile.read()
    regex = '<a class="speaker" href =>(.+?)</a>' #puts tag together
    pattern = re.compile(regex)
    link = re.findall(pattern, htmltext)
    print "the link for the word", word, link #should print link

This is the expected output for the word would http://static.sfdict.com/staticrep/dictaudio/W02/W0245800.mp3

the class I need no matter the word will be "speaker" – Jonathan Holloway Jan 15 '16 at 23:05 — Jonathan Holloway, Jan 15 '16 at 23:05

score 2 · Accepted Answer · edited May 23 '17 at 12:15

2

You should fix your regular expression to grab everything inside the href attribute value:

<a class="speaker" href="(.*?)"

Note that you should really consider switching from regex to HTML parsers, like BeautifulSoup.

Here is how you can apply BeautifulSoup in this case:

import urllib

from bs4 import BeautifulSoup

words = ["would","your", "apple", "orange"]

for word in words:
    urll = "http://dictionary.reference.com/browse/" + word + "?s=t" #produces link
    htmlfile = urllib.urlopen(urll)

    soup = BeautifulSoup(htmlfile, "html.parser")
    links = [link["href"] for link in soup.select("a.speaker")]

    print(word, links)

edited May 23 '17 at 12:15

Community

1
1

answered Jan 15 '16 at 23:06

alecxe

462,703
120
1,088
1,195

so change regex to this regex = '(.*?)' – Jonathan Holloway Jan 15 '16 at 23:41
Okay it worked regex = ' – Jonathan Holloway Jan 16 '16 at 00:40

How to scrape using Python a link from a html class

1 Answers1