I was working on a data scraping from a website. I found that the table data is displayed as loading in the page's source code. I am wondering how to collect that data using python. It seems to be a react js web app.
Asked
Active
Viewed 276 times
1
-
look under dev tool to see the XHR request. If you share the url, I may be able to show you what I mean – chitown88 Dec 20 '19 at 12:32
-
https://www.ycombinator.com/companies/ – srinivas muralidharan Dec 20 '19 at 12:33
-
[Duplicated] See https://stackoverflow.com/q/8049520/12565014 – JxCode Dec 20 '19 at 12:34
2 Answers
1
Can't find it as a request under XHR, so you could use Selenium which will allow the page to render, and then grab the table with pandas:
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
url = 'https://www.ycombinator.com/companies/'
driver.get(url)
df = pd.read_html(driver.page_source)[0]
driver.close()
Output:
print (df)
[ 0 1 2
0 Actiondesk s2019 Google Sheets meets Zapier. Actiondesk lets no...
1 Alana s2019 Helping large companies in LATAM hire blue-col...
2 Apero Health s2019 Modern medical billing.
3 Apurata s2019 Small loans for the Latin American middle clas...
4 Arpeggio Bio s2019 Arpeggio builds technology to watch and learn ...
5 Asayer s2019 Asayer is a session replay tool for developers...
6 Asher Bio s2019 We build better immunotherapies
7 AudioFocus s2019 NaN
8 Axite Labs s2019 A modern IP licensing platform to accelerate t...
9 basis s2019 Software to automate construction workflows, s...
10 Beacons AI s2019 Helping creators monetize through short video ...
11 Binks s2019 Binks is a chain of trusted micro-boutiques th...
12 Blair s2019 Financing college education through Income Sha...
13 Boost Biomes s2019 NaN
14 Bouncer s2019 SDK for scanning and verifying credit cards an...
15 Brave Care s2019 Modern healthcare for kids. We do that with a ...
16 Breadfast s2019 Breadfast delivers fresh bread, milk and eggs ...
17 BuildStream s2019 A market network for industrial labor
18 Business Score s2019 Connecting startups with the things they need.
19 Canix s2019 Canix makes it easy to get and stay compliant ...
20 Carry s2019 Carry plans, books, and supports corporate tra...
21 Carve s2019 NaN
22 Cloosiv s2019 Cloosiv is an order-ahead app for independent ...
23 Coco s2019 The Venezuelan Instacart - allowing Venezuelan...
24 CoLab Software s2019 Jira for Mechanical Engineering Teams
25 Compound s2019 Compound helps people who work at startups und...
26 Courier s2019 Send your product's user notifications to the ...
27 Covela s2019 The digital insurance broker for SMEs in LATAM
28 Cuboh s2019 Cuboh helps restaurants use several delivery p...
29 Curri s2019 We provide on-demand material delivery for the...
... ... ...
2009 Zenter w2007 NaN
2010 Jamglue s2006 NaN
2011 Jumpchat s2006 NaN
2012 Likebetter s2006 NaN
2013 Omgpop s2006 NaN
2014 Pollground s2006 Online polls.
2015 Scribd s2006 World's largest online library.
2016 Shoutfit s2006 NaN
2017 Talkito s2006 NaN
2018 Thinkature s2006 NaN
2019 Xobni s2006 NaN
2020 Zanbazaar s2006 NaN
2021 Audiobeta w2006 NaN
2022 Clustrix w2006 NaN
2023 Flagr w2006 NaN
2024 Inkling w2006 NaN
2025 Project Wedding w2006 NaN
2026 Snipshot w2006 We sold Snipshot to Ansa in 2013.
2027 Wufoo w2006 Online form builder.
2028 Airtime s2005 NaN
2029 Clickfacts s2005 NaN
2030 Infogami s2005 NaN
2031 Kiko s2005 We're the best online calendar solution to eve...
2032 Loopt s2005 NaN
2033 Memamp s2005 NaN
2034 Parakey s2005 NaN
2035 Posthaven s2005 Blogging forever
2036 Reddit s2005 The frontpage of the internet.
2037 Simmery s2005 NaN
2038 TextPayMe s2005 NaN
[2039 rows x 3 columns]]

chitown88
- 27,527
- 4
- 30
- 59
-
-
-
@chitown88 There is an API if you go All Tab under NetWork you will find that. – KunduK Dec 20 '19 at 13:22
-
1@KunduK, good find! Yup that's what I was looking for! But was only looking under XHR. I'll have to remember to check under All tab in the future. Thanks for posting that! Srinivas, accept Kunduks solution. while selenium will work, their's is the better alternative. – chitown88 Dec 20 '19 at 13:38
-
To be more in depth, how does scraping differ from api calls? is this scraping or api call? – srinivas muralidharan Dec 20 '19 at 13:45
-
going through selenium, or requests, or beautifulsoup (and actually pandas' `pd.read_html()` uses beautifulsoup under the hood), you'd be scraping: meaning you are parsing the html source to pull out/extract the data. A request to an API is just directly getting the data. You aren't really scrapping the data then, you're just extracting/querying for the data directly from the source that is rendering the data into the html – chitown88 Dec 20 '19 at 13:52
-
API is always the better way to go if you can. It's usually nicely structured and a lot of the times you can get additional metadata that's not seen in the html easily. – chitown88 Dec 20 '19 at 13:53
-
1@chitown88 : When you said nothing find from XHR I had a doubt and then I checked all tab and find that API link.However I appreciate your effort. – KunduK Dec 20 '19 at 14:03
-
Thanks @KunduK. I appreciate your efforts towards the solution. I am curiously looking for a web scraping technology though – srinivas muralidharan Dec 20 '19 at 15:31
1
If you Go To NetWork Tab you will find below API which returns data in json format.
You don't need selenium
or beautifulsoup
.
Here is the code below.
import requests
res=requests.get("https://api.ycombinator.com/companies/export.json?").json()
for item in res:
try:
print('name:' + item['name'])
except:
continue
try:
print('URL:' + item['url'])
except:
continue
try:
print('batch:' + item['batch'])
except:
continue
try:
print('Description:' + item['description'])
except:
continue
Snapshot Of API
Response:

KunduK
- 32,888
- 5
- 17
- 41