3

I am trying to make an interactive dashboard with analysis, base on car side. I would like user to be able to pick car brand for example BMW, Audi etc. and base on this choise he will have only avaiablity to pick BMW/Audi etc. models. I have a problem after selecting each brand, I am not able to scrape the models that belongs to that brand. Page that I am scraping from:
main page --> https://www.otomoto.pl/osobowe/ sub car brand page example --> https://www.otomoto.pl/osobowe/audi/

I have tried to scrape every option, so later on I can maybe somehow clean the data to store only models

code:

otomoto_models - paste0("https://www.otomoto.pl/osobowe/"audi/")
models <- read_html(otomoto_models) %>%
   html_nodes("option") %>%
   html_text()

But it is just scraping the brands with other options avaiable on the page engine type etc. While after inspecting element I can clearly see models types.

otomoto <- "https://www.otomoto.pl/osobowe/"


brands <- read_html(otomoto) %>%
  html_nodes("option") %>%
  html_text() 

brands <- data.frame(brands)

for (i in 1:nrow(brands)){
  no_marka_pojazdu <- i
    if(brands[i,1] == "Marka pojazdu"){
      break
    }
}
no_marka_pojazdu <- no_marka_pojazdu + 1 
for (i in 1:nrow(brands)){
  zuk <- i
  if(substr(brands[i,1],1,3) == "Żuk"){
    break
  }
}

Modele_pojazdow <- as.character(brands[no_marka_pojazdu:zuk,1])
Modele_pojazdow <- removeNumbers(Modele_pojazdow)
Modele_pojazdow <- substr(Modele_pojazdow,1,nchar(Modele_pojazdow)-2)
Modele_pojazdow <- data.frame(Modele_pojazdow)

Above code is only to pick supported car brands on the webpage and store them in the data frame. With that I am able to create html link and direct everything to one selected brand.

I would like to have similar object to "Modele_pojazdow" but with models limited on previous selected car brand.

Dropdown list with models appears as white box with text "Model pojazdu" next to the "Audi" box on the right side.

1 Answers1

4

Some may frown on the solution language being Python, but the aim of this is was to give some pointers (high level process). I haven't written R in a long time so Python was quicker.

EDIT: R script now added

General outline:

The first dropdown options can be grabbed from the value attribute of each node returned by using a css selector of #param571 option. This uses an id selector (#) to target the parent dropdown select element, and then option type selector in descendant combination, to specify the option tag elements within. The html to apply this selector combination to can be retrieved by an xhr request to the url you initially provided. You want a nodeList returned to iterate over; akin to applying selector with js document.querySelectorAll.

The page uses ajax POST requests to update the second dropdown based on your first dropdown choice. Your first dropdown choice determines the value of a parameter search[filter_enum_make], which is used in the POST request to the server. The subsequent response contains a list of the available options (it includes some case alternatives which can be trimmed out).

I captured the POST request by using fiddler. This showed me the request headers and params in the request body. Screenshot sample shown at end.

The simplest way to extract the options from the response text, IMO, is to regex the appropriate string out (I wouldn't normally recommend regex for working with html but in this case it serves us nicely). If you don't want to use regex, you can grab the relevant info from the data-facets attribute of the element with id body-container. For the non-regex version you need to handle unquoted nulls, and retrieve the inner dictionary whose key is filter_enum_model. I show a function re-write, at the end, to handle this.

The retrieved string is a string representation of a dictionary. This needs converting to an actual dictionary object which you can then extract the option values from. Edit: As R doesn't have a dictionary object a similar structure needs to be found. I will look at this when converting.

I create a user defined function, getOptions(), to return the options for each make. Each car make value comes from the list of possible items in the first dropdown. I loop those possible values, use the function to return a list of options for that make, and add those lists as values to a dictionary, results ,whose keys are the make of car. Again, for R an object with similar functionality to a python dictionary needs to be found.

That dictionary of lists needs converting to a dataframe which includes a transpose operation to make a tidy output of headers, which are the car makes, and columns underneath each header, which contain the associated models.

The whole thing can be written to csv at the end.

So, hopefully that gives you an idea of one way to achieve what you want. Perhaps someone else can use this to help write you a solution.

Python demonstration of this below:

import requests
from bs4 import BeautifulSoup as bs
import re
import ast
import pandas as pd

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
}


def getOptions(make):  #function to return options based on make
    data = {
             'search[filter_enum_make]': make,
             'search[dist]' : '5',
             'search[category_id]' : '29'
            }

    r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)   
    try:
        # verify the regex here: https://regex101.com/r/emvqXs/1
        data = re.search(r'"filter_enum_model":(.*),"new_used"', r.text ,flags=re.DOTALL).group(1) #regex to extract the string containing the models associated with the car make filter 
        aDict = ast.literal_eval(data) #convert string representation of dictionary to python dictionary
        d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
        dirtyList = list(aDict)[:d] #trim to unique values
        cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
    except:
        cleanedList = [] # sometimes there are no associated values in 2nd dropdown
    return cleanedList

r = requests.get('https://www.otomoto.pl/osobowe/')
soup = bs(r.content, 'lxml')
values = [item['value'] for item in soup.select('#param571 option') if item['value'] != '']

results = {}
# build a dictionary of lists to hold options for each make
for value in values:
    results[value] = getOptions(value) #function call to return options based on make

# turn into a dataframe and transpose so each column header is the make and the options are listed below
df = pd.DataFrame.from_dict(results,orient='index').transpose()

#write to csv
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8-sig',index = False )

Sample of csv output:

enter image description here


Example as sample json for alfa-romeo:


Example of regex match for alfa-romeo:

{"145":1,"146":1,"147":218,"155":1,"156":118,"159":559,"164":2,"166":39,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":89,"GTV":7,"Giulia":251,"Giulietta":378,"Mito":224,"Spider":24,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":378,"gt":89,"gtv":7,"mito":224,"spider":24,"sportwagon":2,"stelvio":242}

Example of the filter option list returned from function call with make parameter value alfa-romeo:

['145', '146', '147', '155', '156', '159', '164', '166', '33', 'Alfasud', 'Brera', 'Crosswagon', 'GT', 'GTV', 'Giulia', 'Giulietta', 'Mito', 'Spider', 'Sportwagon', 'Stelvio']

Sample of fiddler request:

enter image description here


Sample of ajax response html containing options:

<section id="body-container" class="om-offers-list"
        data-facets='{"offer_seek":{"offer":2198},"private_business":{"business":1326,"private":872,"all":2198},"categories":{"29":2198,"161":953,"163":953},"categoriesParent":[],"filter_enum_model":{"145":1,"146":1,"147":219,"155":1,"156":116,"159":561,"164":2,"166":37,"33":1,"Alfasud":2,"Brera":34,"Crosswagon":2,"GT":88,"GTV":7,"Giulia":251,"Giulietta":380,"Mito":226,"Spider":25,"Sportwagon":2,"Stelvio":242,"alfasud":2,"brera":34,"crosswagon":2,"giulia":251,"giulietta":380,"gt":88,"gtv":7,"mito":226,"spider":25,"sportwagon":2,"stelvio":242},"new_used":{"new":371,"used":1827,"all":2198},"sellout":null}'
        data-showfacets=""
        data-pagetitle="Alfa Romeo samochody osobowe - otomoto.pl"
        data-ajaxurl="https://www.otomoto.pl/osobowe/alfa-romeo/?search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
        data-searchid=""
        data-keys=''
        data-vars=""

Alternative version of function without regex:

from bs4 import BeautifulSoup as bs

def getOptions(make):  #function to return options based on make
    data = {
             'search[filter_enum_make]': make,
             'search[dist]' : '5',
             'search[category_id]' : '29'
            }

    r = requests.post('https://www.otomoto.pl/ajax/search/list/', data = data, headers = headers)   
    soup = bs(r.content, 'lxml')
    data = soup.select_one('#body-container')['data-facets'].replace('null','"null"')
    aDict = ast.literal_eval(data)['filter_enum_model'] #convert string representation of dictionary to python dictionary
    d = len({k.lower(): v for k, v in aDict.items()}.keys()) #find length of unique keys when accounting for case
    dirtyList = list(aDict)[:d] #trim to unique values
    cleanedList = [item for item in dirtyList if item != 'other' ] #remove 'other' as doesn't appear in dropdown
    return cleanedList

print(getOptions('alfa-romeo'))

R conversion and improved python:

Whilst converting to R I found a better way of extracting the parameters from a js file on the server. If you open dev tools you can see the file listed in the sources tab.

R (To be improved):

library(httr)
library(jsonlite)

url <- 'https://www.otomoto.pl/ajax/jsdata/params/'
r <- GET(url)
contents <- content(r, "text")

data <- strsplit(contents, "var searchConditions = ")[[1]][2]
data <- strsplit(as.character(data), ";var searchCondition")[[1]][1]

source <- fromJSON(data)$values$'573'$'571'
makes <- names(source)

for(make in makes){
  print(make)
  print(source[make][[1]]$value)
  #break
 }

Python:

import requests
import json
import pandas as pd

r = requests.get('https://www.otomoto.pl/ajax/jsdata/params/')
data = r.text.split('var searchConditions = ')[1]
data = data.split(';var searchCondition')[0]
items = json.loads(data)
source = items['values']['573']['571']
makes = [item for item in source]

results = {}

for make in makes:
    df = pd.DataFrame(source[make]) ## build a dictionary of lists to hold options for each make
    results[make]  = list(df['value'])

dfFinal = pd.DataFrame.from_dict(results,orient='index').transpose()  # turn into a dataframe and transpose so each column header is the make and the options are listed below

mask = dfFinal.applymap(lambda x: x is None) #tidy up None values to empty strings https://stackoverflow.com/a/31295814/6241235
cols = dfFinal.columns[(mask).any()]

for col in dfFinal[cols]:
    dfFinal.loc[mask[col], col] = ''
print(dfFinal)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hey, very intresting topic. I tried to use this code in python but I have error: `soup = bs(r.content, 'lxml')` How I can resolve this issue? – DarkousPl Mar 28 '19 at 12:20
  • Am part way through the R conversion but for some reason the post request is hitting a 503 with R so I guess I am doing something wrong with the passing of params in the body. Works fine with Python though. – QHarr Mar 28 '19 at 12:25
  • you need to ensure that the import statement is at the top of all the code _from bs4 import BeautifulSoup as bs_ – QHarr Mar 28 '19 at 12:26
  • First: I changed version of chrome to my, correct? First error occur: soup = `line 29, in bs(r.content, 'lxml').` Second error occur: `line 196, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?` – DarkousPl Mar 28 '19 at 12:27
  • chrome? You could try using a different html parser e.g. soup = bs(r.content, 'html.parser') also try installing lxml – QHarr Mar 28 '19 at 12:28
  • Thanks, with new parser works, but one, I hope, last question - code return empty file. variable 'make' is a variable where I should declare brand or another string? I use your first python code – DarkousPl Mar 28 '19 at 12:34
  • Sorry. I don't understand the last question. Did you update the output filepath? – QHarr Mar 28 '19 at 12:44
  • Yes, of course, I updated. Code works without error but return 0 record, empty file. – DarkousPl Mar 28 '19 at 12:47
  • Update: I used debbuger and I notice one thing: At `soup = bs(r.content, 'html.parser')` debugger return log : access denied. You have the same? ( I use PyCharm) – DarkousPl Mar 28 '19 at 12:52
  • No. Runs fine for me. I am running from Jupyter anaconda – QHarr Mar 28 '19 at 12:54
  • Ok, I will try again at home - maybe our company filtered connection. And I also will use Jupyter. Big thanks for help! – DarkousPl Mar 28 '19 at 13:02
  • No worries. Let me know how it goes. You might be able to help me solve the R part! – QHarr Mar 28 '19 at 13:07
  • Unfortunately, I do not know R, I'm still learning Python (hence so many questions). R = requests.get() occur Response[403]. Strange, I can connect from PyCharm to internet, web site otomoto works in chrome, but debugger still occur access denied. – DarkousPl Mar 28 '19 at 13:49
  • I'm getting the same result as you today. I will need to debug this. – QHarr Mar 28 '19 at 13:54
  • What a pity! I'm surprised that they can detect that requests. – DarkousPl Mar 28 '19 at 14:21
  • Seems to be running again today. @DarkousPl So not sure if short term block or website issue – QHarr Mar 29 '19 at 12:07
  • I have problem when code get all mark, but I narrowed the range to audi. And I have next case: when you open "https://www.otomoto.pl/osobowe/audi/" - how can I decode website and download all elements like price, production year and other. – DarkousPl Mar 29 '19 at 13:02
  • 1
    I noticed currently that sometimes there can be no associated values for second dropdown so have updated code (top version) to handle this – QHarr Mar 29 '19 at 14:05
  • Ok, I took it. Thanks QHarr. How you decode structure of website? I will be happy to train myself that solution – DarkousPl Mar 29 '19 at 14:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/190920/discussion-between-qharr-and-darkouspl). – QHarr Mar 29 '19 at 15:28