0

I'm trying to get BeautifulSoup capture a list of all of the location names through scraping, I used to use the following:

locs = LOOPED.findAll("td", {"class": "max use"})

Which used to work for the HTML

<td class="max use" style="">London</td>

However the HTML has changed to and it's no longer returning London

<td class="max use" style="">
    <div class="notranslate">
        <span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span>
    </div>
</td>

Edit: If I print locs, I get a list like:

<td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/london/">London</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/manchester/">Manchester</a></span> <span class="extra hidden">(DEFAULT)</span>\n</div>\n</td>, <td class="max use" style="">\n<div class="notranslate">\n<span><a data-title="View Location" href="/location/uk/gb/liverpool/">Liverpool</a></span> <span class="extra hidden">(NA)</span>\n</div>\n</td>]

Which as you can see has 3 different locations, from the above I would expect to see a list of [London, Manchester, Liverpool]

I thought that I should be using something like:

locs = LOOPED.findAll("td", {"class": "max use"})
locs = locs.findAll('a')[1]
print locs.text

But this only retuns with

AttributeError: 'ResultSet' object has no attribute 'findAll'

I can't work out how to get the Beautifulsoup to re-search for the a hyperlink text...

Ryflex
  • 5,559
  • 25
  • 79
  • 148
  • Is it not because your 'a' is not directly under 'td', I guess you need to go through 'div' then 'span' first. – quemeraisc May 13 '16 at 09:05
  • @AvinashRaj Yes, if I print `locs` after `locs = LOOPED.findAll("td", {"class": "max use"})` it prints the HTML that has the link under a `div` & `span`. – Ryflex May 13 '16 at 09:09
  • Hey, the issue here is that `locs` is a `list`. If text from each location in `locs` is needed, you'll have to loop over `locs` and print the text in each of the locations. – kreddyio May 13 '16 at 09:27

2 Answers2

2

Try this :

tag = LOOPED.findAll('td') #all "td" tag in a list
tag_a = tag[0].find('a')
print tag_a.text
Ani Menon
  • 27,209
  • 16
  • 105
  • 126
  • That doesn't work for me, it needs to search for the `max use` class first and then look for the `a` – Ryflex May 13 '16 at 09:10
1

A method more robust to future HTML structure changes is to get all of the text inside each td element, as described in this answer:

locs = LOOPED.findAll("td", {"class": "max use"})
for loc in locs:
    print ''.join(loc.findAll(text=True))
Community
  • 1
  • 1
taleinat
  • 8,441
  • 1
  • 30
  • 44