Getting the text from links inside a td with BeautifulSoup in Python 2.7

Question

I'm trying to get BeautifulSoup capture a list of all of the location names through scraping, I used to use the following:

locs = LOOPED.findAll("td", {"class": "max use"})

Which used to work for the HTML

<td class="max use" style="">London</td>

However the HTML has changed to and it's no longer returning London

<td class="max use" style="">
    <div class="notranslate">
        <span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span>
    </div>
</td>

Edit: If I print locs, I get a list like:

<td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/manchester/">Manchester</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/liverpool/">Liverpool</a></span> <span class="extra hidden">(NA)</span>\n</div>\n</td>]

Which as you can see has 3 different locations, from the above I would expect to see a list of [London, Manchester, Liverpool]

I thought that I should be using something like:

locs = LOOPED.findAll("td", {"class": "max use"})
locs = locs.findAll('a')[1]
print locs.text

But this only retuns with

AttributeError: 'ResultSet' object has no attribute 'findAll'

I can't work out how to get the Beautifulsoup to re-search for the a hyperlink text...

Is it not because your 'a' is not directly under 'td', I guess you need to go through 'div' then 'span' first. — quemeraisc, May 13 '16 at 09:05
@AvinashRaj Yes, if I print `locs` after `locs = LOOPED.findAll("td", {"class": "max use"})` it prints the HTML that has the link under a `div` & `span`. — Ryflex, May 13 '16 at 09:09
Hey, the issue here is that `locs` is a `list`. If text from each location in `locs` is needed, you'll have to loop over `locs` and print the text in each of the locations. — kreddyio, May 13 '16 at 09:27

Ani Menon · Accepted Answer · 2016-05-13T13:56:02.050

2

Try this :

tag = LOOPED.findAll('td') #all "td" tag in a list
tag_a = tag[0].find('a')
print tag_a.text

edited May 13 '16 at 13:56

answered May 13 '16 at 09:07

Ani Menon

27,209
16
105
126

That doesn't work for me, it needs to search for the `max use` class first and then look for the `a` – Ryflex May 13 '16 at 09:10

score 1 · Answer 2 · edited May 23 '17 at 10:28

1

A method more robust to future HTML structure changes is to get all of the text inside each td element, as described in this answer:

locs = LOOPED.findAll("td", {"class": "max use"})
for loc in locs:
    print ''.join(loc.findAll(text=True))

edited May 23 '17 at 10:28

Community

1
1

answered May 13 '16 at 09:14

taleinat

8,441
1
30
44

Getting the text from links inside a td with BeautifulSoup in Python 2.7

2 Answers2