0

I want to get data from the table on this website: https://www.skyscrapercenter.com/quick-lists#q=&page=1&type=building&status=COM&status=UCT&min_year=0&max_year=9999&region=0&country=0&city=0 . When I try to read the html content of the table it gives me an empty body, like

<thead>
<tr>
<th width="4%"> <div class="flex">#</div> </th>
<th width="15"> </th>
<th> <div class="flex">Building Name</div> </th>
<th width="15%"> <div class="flex">City</div> </th>
<th width="8%"> <div class="flex">Height m</div> </th>
<th width="8%"> <div class="flex">Floors</div> </th>
<th width="8%"> <div class="flex">Completion</div> </th>
<th width="10%"> <div class="flex">Material</div> </th>
<th width="15%"> <div class="flex">Use</div> </th>
</tr>
</thead>
<tbody>
</tbody>
</table>

Inspect element shows that there is data inside the body, but with my code I can only get information from thead. find_all('tr') only gives me the data from thead and find_all('td') gives nothing. This is my code

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.skyscrapercenter.com/quick-lists#q=&page=1&type=building&status=COM&status=UCT&min_year=0&max_year=9999&region=0&country=0&city=0'

page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table1 = soup.find('table', id='table-buildings')

headers = []
for i in table1.find_all('th'):
    title = i.text
    headers.append(title)
mydata = pd.DataFrame(columns = headers)

# Create a for loop to fill mydata

for j in table1.find_all('tr'):
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.append = row

mydata

I found this similar post, but the link they use is broken so I can't check it and honestly I don't quite know how to adapt the answer to my own situation, as I'm pretty new to scraping.

Another question I have is how can I access the rows on the next pages, I would like to scrape all 500 results and not just the first 50. Thanks in advance!

AxelllD
  • 1
  • 1

1 Answers1

0

This is happening because the table is based on JavaScript. The requests module does not support JS.

Take a look at this question for a solution. Using python Requests with javascript pages

This suggests to requests-html module to deal with JavaScript. Hope its helpful to you.

Primus
  • 36
  • 5
  • 1
    Hey I found another way by using selenium to open the page and then use time to wait a while before extracting the data. The rest of my code worked, I just needed a different way to load the page. Selenium also solved my problem of clicking the next button. – AxelllD Jan 31 '22 at 21:22
  • That's great! Thanks for sharing! – Primus Feb 01 '22 at 05:56