1

I started to write a scraper for the site to collect data on cars. As it turned out, the data structure can change, since the sellers do not fill all the fields, because of what there are fields that can change, and during the scraper as a result in the csv file, the values ​​are in different fields.

page example:

https://www.olx.ua/obyavlenie/prodam-voikswagen-touran-2011-goda-IDBzxYq.html#87fcf09cbd

https://www.olx.ua/obyavlenie/fiat-500-1-4-IDBjdOc.html#87fcf09cbd

data example: Data example

One approach was to check the field name with text () = "Category name", but I'm not sure how to correctly write the result to the correct cells.

Also I use the built-in Google developer tool, and with the help of the command document.getElementsByClassName('margintop5')[0].innerText I brought out the whole contents of the table, but the results are not structured.

So, if the output can be in json format then it would solve my problem?

innerText result

In addition, when I studied the page code, I came across a javascript script in which all the necessary data is already structured, but I do not know how to get them.

                 <script type="text/javascript">
                var GPT = GPT || {};
                GPT.targeting = {"cat_l0":"transport","cat_l1":"legkovye-avtomobili","cat_l2":"volkswagen","cat_l0_id":"1532","cat_l1_id":"108","cat_l2_id":"1109","ad_title":"volkswagen-jetta","ad_img":"https:\/\/img01-olxua.akamaized.net\/img-olxua\/676103437_1_644x461_volkswagen-jetta-kiev.jpg","offer_seek":"offer","private_business":"private","region":"ko","subregion":"kiev","city":"kiev","model":["jetta"],"modification":[],"motor_year":[2006],"car_body":["sedan"],"color":["6"],"fuel_type":["543"],"motor_engine_size":["1751-2000"],"transmission_type":["546"],"motor_mileage":["175001-200000"],"condition":["first-owner"],"car_option":["air_con","climate-control","cruise-control","electric_windows","heated-seats","leather-interior","light-sensor","luke","on-board-computer","park_assist","power-steering","rain-sensor"],"multimedia":["acoustics","aux","cd"],"safety":["abs","airbag","central-locking","esp","immobilizer","servorul"],"other":["glass-tinting"],"cleared_customs":["no"],"price":["3001-5000"],"ad_price":"4500","currency":"USD","safedealads":"","premium_ad":"0","imported":"0","importer_code":"","ad_type_view":"normal","dfp_user_id":"e3db0bed-c3c9-98e5-2476-1492de8f5969-ver2","segment":[],"dfp_segment_test":"76","dfp_segment_test_v2":"46","dfp_segment_test_v3":"46","dfp_segment_test_v4":"32","adx":["bda2p24","bda1p24","bdl2p24","bdl1p24"],"comp":["o12"],"lister_lifecycle":"0","last_pv_imps":"2","user-ad-fq":"2","ses_pv_seq":"1","user-ad-dens":"2","listingview_test":"1","env":"production","url_action":"ad","lang":"ru","con_inf":"transportxxlegkovye-avtomobilixx46"};

data in json dict

How can I get the data from the pages using python and scrapy?

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
Nick
  • 35
  • 7

1 Answers1

2

You can do it by extracting the JS code from the <script> block, using a regex to get only the JS object with the data and then loading it using the json module:

query = 'script:contains("GPT.targeting = ")::text'
js_code = response.css(query).re_first('targeting = ({.*});')
data = json.loads(js_code)

This way, data is a python dict containing the data from the JS object.

More about the re_first method here: https://doc.scrapy.org/en/latest/topics/selectors.html#using-selectors-with-regular-expressions

Valdir Stumm Junior
  • 4,568
  • 1
  • 23
  • 31