0

I'm practicing scraping using a real-estate website, and I want to scrape all addresses for recent sales. For example, the part of the website HTML looks like this: url = https://www.compass.com/agents/irene-vuong/

<div class="profile-active-listings" role="tabpanel" id="active-listings-sales">
    <div class="card-content">
      <a class="card-title" href="/listing" data-tn="label-address"> 111 East 35th </a>
                                            ........
<div class="textIntent-headline1"> Recent Sales</div>
    <div class="card-content">
      <a class="card-title" href="/morelisting" data-tn="label-address"> East 4th </a>

And I'm trying to get access to all address, using below code:

for i in range(0, 30):
    h = soup.findAll('a', {'class':'card-title'})[i]
    print(h)

However, I get an error of:

IndexError: list index out of range

I get the first few addresses, but only right before "Recent Sales". It's only getting addresses on the first part but not the entire website. How do I get all addresses?

Sarah
  • 627
  • 7
  • 25
  • Could you share the actual page you're trying to scrape? – Zachary Blackwood Mar 03 '20 at 21:51
  • @ZacharyBlackwood I have just added! – Sarah Mar 03 '20 at 21:55
  • It looks like you might be using the wrong `class`. There are currently 12 items on that page with the class `uc-listingCart-title`, not `card-title`. If you loop through those as suggested by @user2263572 (as opposed to hard-coding the `30`), that should give you all the items you're looking for. – Zachary Blackwood Mar 03 '20 at 22:03
  • @ZacharyBlackwood Hi, I tried the suggestion but it still only gets part of it and not all.... :-( – Sarah Mar 04 '20 at 15:11
  • Ah. Looks like the extra items are being added dynamically on the front-end. This answer might be helpful for getting the page contents after javascript has added the items. https://stackoverflow.com/a/26440563/5031672 – Zachary Blackwood Mar 05 '20 at 15:41

1 Answers1

0

The findAll method returns a list of all elements that match your search criteria.

In your case, it returns a list of length 2.

you are then iterating through 0-29 and looking for those indexes on your list of length2.

Hence your error.

Your code should read something more like:

for x in soup.findAll('a', {'class':'card-title'}):
  print(x)
user2263572
  • 5,435
  • 5
  • 35
  • 57
  • Above response answers your original question. This answers your current questions. https://stackoverflow.com/questions/16322862/beautiful-soup-findall-doesnt-find-them-all – user2263572 Mar 04 '20 at 16:04
  • Hi I'm not sure how the post is relevant to my questions... I'm using 'html.parser' in my code. Can you please explain? It will be greatly appreciated. – Sarah Mar 04 '20 at 18:29
  • Your question "Scraping data with multiple same class name using BeautifulSoup" and your issue is "findall not finding everything it should". I linked the question "Beautiful Soup findAll doesn't find them all". I think it's relevant. Either the link explains your issue, or you aren't using correct css selectors. – user2263572 Mar 04 '20 at 18:56
  • Unfortunately, I don't think the post is helpful :-( I think my problem is different than the one in the post. – Sarah Mar 04 '20 at 18:58