Extract links from a dynamically constructed web page in Python

Question

I am trying to extract specific elements from a dynamically constructed site (Namely this site). While I am not fluent in web technology, looking at the page source code it appears it is dynamically generated.

Therefore, I understood a web driver such as Selenium would be required to scrape it instead of a simple get request. (Which gets me the code generating this page instead of the actual result I view as a user)

However, it was unclear to me how exactly Selenium should be used for such a case. I saw it allows searching by class, and I saw that the smallest class containing what I required was "table table-bordered table-hover " which had a element, and under it a series of tr elements, each broken into td elements, and that I can get the file name from the button onclick action in those.

I saw that the table I required had a tag of class="table table-bordered table-hover ", so naturally I tried to look for it with:

driver.find_element(By.CLASS_NAME, 'table table-bordered table-hover ')

But I got an exception stating it was not found. Since it does exist, I assume I am misusing the find_element command. What is the right way to use it?

Note - I noticed that in this specific case, I can go to MainIO_Hok.aspx instead and get what I need, but I was wondering how I could have handled it directly from the page I was viewing.

I discovered that in this specific case I was able to identify the buttons themselves since each had a different ID. (e.g. driver.find_element(By.ID, 'button0')), but I was hoping that my original approach could have worked as well. — yuvalm2, Jul 22 '23 at 14:36
Try using id to find the element, use the following command: `driver.find_element('id', 'myTable')`. It should work. — Ξένη Γήινος, Jul 22 '23 at 17:02

score 0 · Answer 1 · answered Jul 22 '23 at 19:25

The data you see in the table is fetched from the external URL via JavaScript. So you can use selenium to download it, or python-requests to simulate this request and get the data in Json form:

import requests
import pandas as pd


data_url = 'https://www.kingstore.co.il/Food_Law/MainIO_Hok.aspx?_=1690053691921&WStore=&WDate=&WFileType=0'
data = requests.get(data_url).json()
df = pd.DataFrame(data)

df.pop('PathLogo')
df['url'] = 'https://www.kingstore.co.il/Food_Law/Download/' + df['FileNm']
print(df)

Prints:

                                   FileNm Company                                         Store TypeFile TypeExpFile          DateFile                                                                                   url
0  Price7290058108879-338-202307222101.gz       1  338 דוכאן חי אלוורוד                           מחירים          gz  21:01 22/07/2023  https://www.kingstore.co.il/Food_Law/Download/Price7290058108879-338-202307222101.gz
1  Price7290058108879-337-202307222101.gz       1  337 דוכאן אעבלין                               מחירים          gz  21:01 22/07/2023  https://www.kingstore.co.il/Food_Law/Download/Price7290058108879-337-202307222101.gz
2  Price7290058108879-336-202307222101.gz       1  336 דוכאן קלנסווה                              מחירים          gz  21:01 22/07/2023  https://www.kingstore.co.il/Food_Law/Download/Price7290058108879-336-202307222101.gz
3  Price7290058108879-335-202307222101.gz       1  335 דוכאן כפר ברא                              מחירים          gz  21:01 22/07/2023  https://www.kingstore.co.il/Food_Law/Download/Price7290058108879-335-202307222101.gz
4  Price7290058108879-334-202307222101.gz       1  334 דיר חנא זכיינות                            מחירים          gz  21:01 22/07/2023  https://www.kingstore.co.il/Food_Law/Download/Price7290058108879-334-202307222101.gz

...

score 0 · Answer 2 · answered Jul 22 '23 at 21:46

To scrape the table from the website you need to induce WebDriverWait for the visibility_of_element_located() and using DataFrame from Pandas you can use the following locator strategy:

Code Block:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver.get("https://www.kingstore.co.il/Food_Law/Main.aspx")
time.sleep(5)
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.table.table-bordered.table-hover#myTable"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
driver.quit()

Console Output:

[                                    שם קובץ                  סניף     סוג סיומת             תאריך      Unnamed: 5
0    Price7290058108879-338-202307230001.gz  338 דוכאן חי אלוורוד  מחירים    gz  00:01 23/07/2023  להורדה לחץ כאן
1    Price7290058108879-337-202307230001.gz      337 דוכאן אעבלין  מחירים    gz  00:01 23/07/2023  להורדה לחץ כאן
2    Price7290058108879-336-202307230001.gz     336 דוכאן קלנסווה  מחירים    gz  00:01 23/07/2023  להורדה לחץ כאן
3    Price7290058108879-335-202307230001.gz     335 דוכאן כפר ברא  מחירים    gz  00:01 23/07/2023  להורדה לחץ כאן
4    Price7290058108879-334-202307230001.gz   334 דיר חנא זכיינות  מחירים    gz  00:01 23/07/2023  להורדה לחץ כאן
..                                      ...                   ...     ...   ...               ...             ...
995  Price7290058108879-012-202307210401.gz               12 נצרת  מחירים    gz  04:01 21/07/2023  להורדה לחץ כאן
996  Price7290058108879-010-202307210401.gz      10 דליית אל כרמל  מחירים    gz  04:01 21/07/2023  להורדה לחץ כאן
997  Price7290058108879-008-202307210401.gz             8 באר שבע  מחירים    gz  04:01 21/07/2023  להורדה לחץ כאן
998  Price7290058108879-007-202307210401.gz               7 סכנין  מחירים    gz  04:01 21/07/2023  להורדה לחץ כאן
999  Price7290058108879-006-202307210401.gz               6 שפרעם  מחירים    gz  04:01 21/07/2023  להורדה לחץ כאן

[1000 rows x 6 columns]]

score 0 · Answer 3 · edited Aug 21 '23 at 10:06

When a web page is generated dynamically, its content is loaded from somewhere. You can observe this process using the developer tools function of your web browser, which is usually accessed by pressing the F12 key. By accessing the network section, you will be able to see the requests that the web page is making. Upon analyzing these requests, you will notice that one of them returns all the data needed to fill the table, meaning it queries an API. The URL for this request is as follows: https://www.kingstore.co.il/Food_Law/MainIO_Hok.aspx?_=1690053691921&WStore=&WDate=&WFileType=0.

Therefore, to obtain all the data, you can simply make a direct request to this API. Through a simple request, you will be able to retrieve the required data.

import requests

website = 'https://www.kingstore.co.il/Food_Law/MainIO_Hok.aspx?_=1690053691921&WStore=&WDate=&WFileType=0'
response = requests.get(website)

for item in response.json():
    print('FileNm: %s' % item['FileNm'])
    print('Company: %s' % item['Company'])
    print('Store: %s' % item['Store'])
    print('TypeFile: %s' % item['TypeFile'])
    print('TypeExpFile: %s'% item['TypeExpFile'])
    print('DateFile: %s' % item['DateFile'])

Output:

FileNm: Price7290058108879-200-202307221601.gz
Company: 1
Store: 200 ירושליים                                
TypeFile: מחירים
TypeExpFile: gz
DateFile: 16:01 22/07/2023

FileNm: Price7290058108879-019-202307221601.gz
Company: 1
Store: 19 רמלה                                    
TypeFile: מחירים
TypeExpFile: gz
DateFile: 16:01 22/07/2023

FileNm: Price7290058108879-018-202307221601.gz
Company: 1
Store: 18 יפת - יפו תל אביב                       
TypeFile: מחירים
TypeExpFile: gz
DateFile: 16:01 22/07/2023

FileNm: Price7290058108879-017-202307221601.gz
Company: 1
Store: 17 יפיע                                    
TypeFile: מחירים
TypeExpFile: gz
DateFile: 16:01 22/07/2023

...

Extract links from a dynamically constructed web page in Python

3 Answers3