0

I'm trying to scrape some informations on the car from leboncoin.

I used jupyter notebook to overcome Datadome. Here's my first cell :

import pandas as pd
import numpy as np
import time
import random
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys


PATH = "chromedriver.exe"

options = webdriver.ChromeOptions() 

options.add_argument("--disable-gpu")
options.add_argument('enable-logging')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

driver = webdriver.Chrome(options=options, executable_path=PATH)

driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})

url = 'https://www.leboncoin.fr/voitures/offres'

driver.get(url)

Here I swich manually to bypass the robot test and I run that :

cookie = driver.find_element_by_xpath('//*[@id="didomi-notice-agree-button"]')
try:
    cookie.click()
except:
    pass

time.sleep(2)


car = driver.find_element_by_xpath('//input[@autocomplete="search-keyword-suggestions"]')
car.click()
car.send_keys('Peugeot')
car.send_keys(Keys.ENTER)
time.sleep(3)

and after that, I run this :

next = driver.find_element_by_xpath('//a[@title="Page suivante"]')

for x in range(2):
    time.sleep(3)
    
    links = driver.find_elements_by_class_name("styles_adCard__2YFTi")
    for l in links:
        data = l.text
        print(data)
        print()
        
    next = driver.find_element_by_xpath('//a[@title="Page suivante"]')
    next.click()
    time.sleep(3)

Unfortunately I cannot find how to create a proper datafram, and that's because the html strucutre is the same for all the object I wanted, so I cannot separate them distinctly.

I obtain somthing like that :

5
Peugeot 2008 1,6l Blue-HDI 92cv à 7800e (Ann2015/Toit panoramique)
PRO
7 500 €
Année
2015
Kilométrage
100000 km
Carburant
Diesel
Boîte de vitesse
Manuelle
Baie-Mahault 97122

5
PEUGEOT 206 1.4 i 75 CV XR PRESENCE
PRO
3 990 €
Année
2002
Kilométrage
104152 km
Carburant
Essence
Boîte de vitesse
Manuelle
Châtellerault 86100

And I would like some sort of dataframe, something like that :

outputdesired

Usually I can solve that but usually, the html structure distinct each element, here it's all the same so I'm kind of lost.

RandallCloud
  • 123
  • 9

1 Answers1

0

Looking at the page source of "https://www.leboncoin.fr/voitures/offres", I see that the data is contained in HTML classes that have the same names. I understand this is the issue that you are referring to.

Both 'Year' and 'Kilometrage', for instance, are contained in classes of the same title:

> <span class="_137P- P4PEa _3j0OU">2020</span>

> <span class="_137P- P4PEa _3j0OU">40000 km</span>

Elements you are after all carry specific and constant order on the webpage.

In my experience, in tricky situations I have picked up each element from webpages separately using Beautiful soup, using .child / .sibling / .parent methods.

I believe you still should be able to pick them up using Selenium, with xpath stating which element in the list of siblings you are after. As such:

//*[@id="container"]/main/div/div[1]/div[6]/div/div[5]/div[1]/div[1]/div[1]/div[2]/a/div/div[2]/div[1]/div[4]/div/p[1]/span

The final p[n] in the above xpath string is the element you are after:

p[1] = 'Year' 
p[2] = 'Kilometrage'
p[3] = etc.
  • I tried `driver.find_elements_by_xpath('[@id="container"]/main/div/div[1]/div[6]/div/div[5]/div[1]/div[1]/div[1]/div[2]/a/div/div[2]/div[1]/div[4]/div/p[1]/span')` but it doesn't work :/ – RandallCloud Aug 05 '21 at 14:13
  • Add //* to the start of the string. – Andrey Popov Aug 05 '21 at 14:24
  • I tried this `links = driver.find_elements_by_xpath('//*[@id="container"]/main/div/div[1]/div[6]/div/div[5]/div[1]/div[1]/div[1]/div[2]/a/div/div[2]/div[1]/div[4]/div/p[1]/span') for l in links: data = l.text print(data)` but it doesn't print anything. – RandallCloud Aug 05 '21 at 14:47
  • I think you should be finding a single element (not elements), hence driver.find_element_by_xpath('xpath').text = 'Year' – Andrey Popov Aug 05 '21 at 14:53
  • `links = driver.find_element_by_xpath('//*[@id="container"]/main/div/div[1]/div[6]/div/div[5]/div[1]/div[1]/div[1]/div[2]/a/div/div[2]/div[1]/div[4]/div/p[1]/span').text print(links)` But again doesn't print anything – RandallCloud Aug 05 '21 at 14:56
  • I copy the xpath from the website and it worked but I just have the first, how can I have all the data from the page ? – RandallCloud Aug 05 '21 at 15:00
  • I was able to get the year value with the above string in question. Therefore my input was: driver.find_element_by_xpath('//*[@id="container"]/main/div/div[1]/div[6]/div/div[5]/div[1]/div[1]/div[1]/div[2]/a/div/div[2]/div[1]/div[4]/div/p[1]/span').text which have the output of '2011'. It is possible your page is different to mine - use Google Chrome => Right click on the element you want => Inspect => Hoover over HTML snippet => Copy => XPath. – Andrey Popov Aug 05 '21 at 15:08
  • I would construct xpath string for each element, e.g: for j in range(4): item = str(j+1) xpath_string = '//*[@id="container"]/div/div/div/div/div/p['+item+']/span' print(xpath_string) You would need to copy xpaths of elements you are after, and then construct path for each one you are after on the page. – Andrey Popov Aug 05 '21 at 15:13
  • I use your solution updated with my code and I have something ! It need work but I think he'll work – RandallCloud Aug 10 '21 at 10:58