Hi. I'm trying to scrape infinite scrolling website. It stuck in 200th data

Question

I scrolled with selenium and grabbed all urls and used these urls in beautifulsoup.But there are so many duplicates in scraped data.I tried to left them with drop_duplicates but it stack in about 200th data .I cannot detect the problem. I add the code which i use. I want to grab all prices,areas,rooms et.c.



import requests

from lxml import html

from bs4 import BeautifulSoup as bs
import bs4
import pandas as pd


from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from lxml import html
import pandas as pd
import time

driver = webdriver.Chrome(r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe')
driver.get('https://tap.az/elanlar/dasinmaz-emlak/menziller')
time.sleep(1)

price = []
citi = []
elann = []
bina = []
arrea = []
adres = []
roome = []
baxhise = []
mulkayet = []
descript = []
urll = []
zefer = []
previous_height = driver.execute_script('return document.body.scrollHeight')


while True:
    
    driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    
    time.sleep(2)
    
    new_height = driver.execute_script('return document.body.scrollHeight')
    if new_height == previous_height:
        break 
    previous_height = new_height
    lnks=driver.find_elements(By.CSS_SELECTOR, '#content > div > div > div.categories-products.js-categories-products > div.js-endless-container.products.endless-products > div.products-i')
    for itema in lnks:
        urla=itema.find_element(By.TAG_NAME, 'a')  
        aae = (urla.get_attribute('href'))
        urel = aae.split('/bookmark')[0]
        result = requests.get(urel)
        soup = bs(result.text, 'html.parser')
        casee = soup.find_all("div",{"class":"lot-body l-center"})
        for ae in casee:        
            c =  ae.find_all('table', class_ = 'properties')
            pp = c[0].text
            city = pp.split('Şəhər')[-1].split('Elanın')[0].replace('ş' ,'sh').replace('ə' ,'e').replace('ü' ,'u').replace('ö' ,'o').replace('ı' ,'i').replace('ğ' ,'g').replace('ç' ,'ch').replace('Ç', 'ch').replace('Ş', 'sh').replace('Ə' ,'e').replace('Ü' ,'u').replace('Ö' ,'o').replace('İ', 'I')
            cxe = c[0].text
            elan_tipi = cxe.split('Elanın tipi')[-1].split('Binanın tipi')[0].replace(' verilir','')
            elane = elan_tipi.replace(' ', '_').replace('ş' ,'sh').replace('ə' ,'e').replace('ü' ,'u').replace('ö' ,'o').replace('ı' ,'i').replace('ğ' ,'g').replace('ç' ,'ch').replace('Ç', 'ch').replace('Ş', 'sh').replace('Ə' ,'e').replace('Ü' ,'u').replace('Ö' ,'o').replace('İ', 'I')
            cx = c[0].text
            bina_tipi = cx.split('Binanın tipi')[-1].split('Sahə')[0].replace(' ', '_').replace('ş' ,'sh').replace('ə' ,'e').replace('ü' ,'u').replace('ö' ,'o').replace('ı' ,'i').replace('ğ' ,'g').replace('ç' ,'ch').replace('Ç', 'ch').replace('Ş', 'sh').replace('Ə' ,'e').replace('Ü' ,'u').replace('Ö' ,'o').replace('İ', 'I')
            cx = c[0].text
            area = cx.split('tikiliSahə,')[-1].split('Otaq')[0].replace('m²', '')                         
            cx = c[0].text
            room = cx.split('Otaq sayı')[-1].split('Yerləşmə yeri')[0]
            cx = c[0].text
            addresss = cx.split('Yerləşmə yeri')[-1].replace('ş' ,'sh').replace('ə' ,'e').replace('ü' ,'u').replace('ö' ,'o').replace('ı' ,'i').replace('ğ' ,'g').replace('ç' ,'ch').replace('Ç', 'ch').replace('Ş', 'sh').replace('Ə' ,'e').replace('Ü' ,'u').replace('Ö' ,'o').replace('İ', 'I')
            d = ae.find_all('p')
            elan_kod = (d[0].text.replace('Elanın nömrəsi:', ''))
            d = ae.find_all('p')
            baxhis = d[1].text.replace('Baxışların sayı: ', '')
            d = ae.find_all('p')
            description = (d[3].text.replace('Baxışların sayı: ', '').replace('ş' ,'sh').replace('ə' ,'e').replace('ü' ,'u').replace('ö' ,'o').replace('ı' ,'i').replace('ğ' ,'g').replace('ç' ,'ch').replace('Ç', 'ch').replace('Ş', 'sh').replace('Ə' ,'e').replace('Ü' ,'u').replace('Ö' ,'o').replace('İ', 'I').replace("\n", ''))
            kim =  ae.find_all('div', class_ = 'author')
            kime = kim[0].text
            if 'bütün' in kime:
                mulkiyet = int(0)
            else:
                mulkiyet = int(1)
        caseee = soup.find_all("div",{"class":"middle"})
        for aecex in caseee:        
            pricxxe =  aecex.find_all('span', class_ = 'price-val') 
            pricef = pricxxe[0].text.replace(' ' , '')
            price.append(pricef)
            zefer.append(elane)
            elann.append(elan_kod)
            citi.append(city)
            bina.append(bina_tipi)
            arrea.append(area)
            adres.append(addresss)
            roome.append(room)
            baxhise.append(baxhis)
            mulkayet.append(mulkiyet)
            descript.append(description)
            ae = pd.DataFrame({'URL': urel,'Unique_id': elann,'Price': price,'Room': roome,'Area': arrea,'Seher': citi,'Elan_tipi': zefer,'Description': descript,'Address': adres,'Category': bina,'Mulkiyyet': mulkayet})
            aere = ae.drop_duplicates()
            aere.to_csv('dde.csv', index=False, encoding='utf-8' )

You should use `from webdriver_manager.chrome import ChromeDriverManager` in place of `webdriver.Chrome(path_name_to_webdriver.exe)`. — D.L, Nov 28 '22 at 11:19
what do you mean by it's stuck? is there an error? or does the program just hang? — Driftr95, Nov 28 '22 at 21:30
@D.L in place of webdriver.Chrome(path_name_to_webdriver.exe) is an error could you please give more explanation that. thanks — Rasim Dilbani, Nov 30 '22 at 07:21
@ElxanCabbarli : this might not solve, but it is a cleaner way to access the latest version of chromedriver which reduces error rate. https://stackoverflow.com/questions/71603374/webdriverexception-unknown-error-cannot-find-chrome-binary-error-when-trying-t/71626868#71626868 — D.L, Nov 30 '22 at 07:29

Driftr95 · Accepted Answer · 2022-12-01T12:29:04.790

A cause of duplicates is that every time you get lnks, you're getting the products you scraped before scrolling as well. You can probably skip duplicate scrapes by initiating scrapedUrls = [] somewhere at the beginning of your code (OUTSIDE of all loops), and then checking urel against it, as well as adding to it

        if urel in scrapedUrls: continue ## add this line
        result = requests.get(urel) ## from your code
        scrapedUrls.append(urel) ## add this line

but I'm not sure it'll solve your issue.

I don't know why it's happening, but when I try to scrape the links with selenium's find_elements, I get the same url over and over; so I wrote a fuction [getUniqLinks] that you can use to get a unique list of links (prodUrls) by scrolling up to a certain number of times and then parsing page_source to BeautifulSoup. Below are two lines from the printed output of prodUrls = getUniqLinks(fullUrl, rootUrl, max_scrolls=250, tmo=1):

WITH SELENIUM found 10957 product links [1 unique] 
 
PARSED PAGE_SOURCE  --->  found   12583 product links   [12576 unique]

(The full function and printed output are at https://pastebin.com/b3gwUAJZ.)

Some notes:

If you increase tmo, you can increase max_scrolls too, but it starts getting quite slow after 100 scrolls.
I used selenium to get links as well, just to print and show the difference, but you can remove all lines that end with # remove to get rid of those unnecessary parts.
I used selenium's WebDriverWait instead of time.sleep because it stops waiting after the relevant elements have loaded - it raises an error if it doesn't load it the allowed time (tmo), so I found it more convenient and readable to use in a try...except block instead of using driver.implicitly_wait
I don't know if this is related to whatever is causing your program to hang [since mine is probably just because of the number of elements being too many], but mine also hangs if I try to use selenium to get all the links after scrolling instead of adding to prodLinks in chunks inside the loop.

Now, you can loop through prodUrls and get the data you want, but I think it's better to build a list with a separate dictionary for each link [i.e., having a dictionary for each row rather than having a separate list for each column].

If you use these two functions, then you just have to prepare a reference dictionary of selectors like

refDict = {
    'title': 'h1.js-lot-title',
    'price_text': 'div.price-container',
    'price_amt': 'div.price-container > .price span.price-val',
    'price_cur': 'div.price-container > .price span.price-cur',
    '.lot-text tr.property': {'k':'td.property-name', 'v':'td.property-value'},
    'contact_name': '.author > div.name',
    'contact_phone': '.author > a.phone',
    'lot_warning': 'div.lot-warning',
    'div.lot-info': {'sel': 'p', 'sep': ':'},
    'description': '.lot-text p'
}

that can be passed to fillDict_fromTag like in the code below:

## FIRST PASTE FUNTION DEFINITIONS FROM https://pastebin.com/hKXYetmj

productDetails = []
puLen = len(prodUrls)
for pi, pUrl in enumerate(prodUrls[:500]):
    print('', end=f'\rScraping [for {pi+1} of {puLen}] {pUrl}')
    pDets = {'prodId': [w for w in pUrl.split('/') if w][-1]}

    resp = requests.get(pUrl)
    if resp.status_code != 200:
        pDets['Error_Message'] = f'{resp.raise_for_status()}'
        pDets['sourceUrl'] = pUrl
        productDetails.append(pDets) 
        continue
    
    pSoup = BeautifulSoup(resp.content, 'html.parser')
    pDets = fillDict_fromTag(pSoup, refDict, pDets, rootUrl)

    pDets['sourceUrl'] = pUrl
    productDetails.append(pDets)
print()
prodDf = pd.DataFrame(productDetails).set_index('prodId')
prodDf.to_csv('ProductDetails.csv')

I have uploaded both 'prodLinks.csv' and 'ProductDetails.csv' here, although there are only the first 500 scrapes' results since I manually interrupted after around 20 minutes; I'm also pasting the first 3 rows here (printed with print(prodDf.loc[prodDf.index[:3]].to_markdown()))

|   prodId | title                                                    | price_text   | price_amt   | price_cur   | Şəhər   | Elanın tipi    | Elanın tipi [link]                                              | Binanın tipi   | Binanın tipi [link]                                             |   Sahə, m² |   Otaq sayı | Yerləşmə yeri   | contact_name   | contact_phone   | lot_warning                                                                  |   Elanın nömrəsi |   Baxışların sayı | Yeniləndi      | description                                                                                                                                                                                                                                                                                                                                                                    | sourceUrl                                                |
|---------:|:---------------------------------------------------------|:-------------|:------------|:------------|:--------|:---------------|:----------------------------------------------------------------|:---------------|:----------------------------------------------------------------|-----------:|------------:|:----------------|:---------------|:----------------|:-----------------------------------------------------------------------------|-----------------:|------------------:|:---------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------|
| 35828514 | 2-otaqlı yeni tikili kirayə verilir, 20 Yanvar m., 45 m² | 600 AZN      | 600         | AZN         | Bakı    | Kirayə verilir | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B740%5D=3724 | Yeni tikili    | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B747%5D=3849 |         45 |           2 | 20 Yanvar m.    | Elşad Bəy      | (055) 568-12-13 | Diqqət! Beh göndərməmişdən öncə sövdələşmənin təhlükəsiz olduğuna əmin olun! |         35828514 |               105 | 22 Noyabr 2022 | 20 Yanvar metrosuna və Inşatcılar metrosuna 8 - 11 dəiqqə arası olan ərazidə, yeni tikili binada 1 otaq 2 otaq təmirli şəraitiynən mənzil kirayə 600 manata, ailiyə və iş adamına verilir. Qabaqçadan 2 ay ödəniş olsa kamendant pulu daxil, ayı 600 manat olaçaq, mənzili götūrən şəxs 1 ayın 20 % vasitəciyə ödəniş etməlidir. Xahìş olunur, rial olmuyan şəxs zəng etməsin. | https://tap.az/elanlar/dasinmaz-emlak/menziller/35828514 |
| 35833080 | 1-otaqlı yeni tikili kirayə verilir, Quba r., 60 m²      | 40 AZN       | 40          | AZN         | Quba    | Kirayə verilir | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B740%5D=3724 | Yeni tikili    | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B747%5D=3849 |         60 |           1 | Quba r.         | Orxan          | (050) 604-27-60 | Diqqət! Beh göndərməmişdən öncə sövdələşmənin təhlükəsiz olduğuna əmin olun! |         35833080 |               114 | 22 Noyabr 2022 | Quba merkezde her weraiti olan GUNLUK KIRAYE EV.Daimi isti soyuq su hamam metbex wifi.iwciler ve aile ucun elveriwlidir Təmirli                                                                                                                                                                                                                                                | https://tap.az/elanlar/dasinmaz-emlak/menziller/35833080 |
| 35898353 | 4-otaqlı mənzil, Nizami r., 100 m²                       | 153 000 AZN  | 153 000     | AZN         | Bakı    | Satılır        | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B740%5D=3722 | Köhnə tikili   | https://tap.az/elanlar/dasinmaz-emlak/menziller?p%5B747%5D=3850 |        100 |           4 | Nizami r.       | Araz M         | (070) 723-54-50 | Diqqət! Beh göndərməmişdən öncə sövdələşmənin təhlükəsiz olduğuna əmin olun! |         35898353 |                71 | 27 Noyabr 2022 | X.Dostluğu metrosuna 2 deq mesafede Leninqrad lahiyeli 9 mərtəbəli binanın 5-ci mərtəbəsində 4 otaqlı yaxsi temirli mənzil satılır.Əmlak ofisinə ödəniş alıcı tərəfindən məbləğin 1%-ni təşkil edir.                                                                                                                                                                           | https://tap.az/elanlar/dasinmaz-emlak/menziller/35898353 |

@Drift95 Firstly Thank you so much. How did you get Product detail 'ProductDetails.csv' just have links and code, which is in https://pastebin.com/b3gwUAJZ just gave links. Could you please share code which's output is product details. — Rasim Dilbani, Dec 01 '22 at 05:15
I want to get details always but every time I had to get detail by detail. Is there any way to do it just having all details in that way.If you advice me some videos or readings about that I would be happy. @Driftr95 — Rasim Dilbani, Dec 01 '22 at 05:18
So sorry, I did not see you pasted it actually. So thanks again. — Rasim Dilbani, Dec 01 '22 at 05:55

Hi. I'm trying to scrape infinite scrolling website. It stuck in 200th data

1 Answers1

Linked