2

I'm aiming to scrape this URL.

Each item in the list links to more information about it. I aim to scrape all the 17000 linked pages. Only 10 results are shown and the view more button makes a request that adds, via JSON, 10 more results to the list. I've attempted to modify the request by changing batchsize, the parameter used to define the number of results in the list, but that didn't work. I've also attempted to use this code (from a tutorial), but couldn't adapt it to my specific task:

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes?page=%s'
    start_urls = [quotes_base_url % 1]
    download_delay = 1.5

    def parse(self, response):
        data = json.loads(response.body)
        for item in data.get('quotes', []):
            yield {
                'text': item.get('text'),
                'author': item.get('author', {}).get('name'),
                'tags': item.get('tags'),
            }
        if data['has_next']:
            next_page = data['page'] + 1
            yield scrapy.Request(self.quotes_base_url % next_page)

I've looked at examples here, here and here. However, after 2 days of trying, I still cannot figure out how to solve this because the URL request on the site I wish to scrape differs from all the examples, and it seems they've made it more difficult to scrape...

The request made by hitting view more is the following:

Request URL: https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p=2&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a

The p= parameter increaes incrementally when hitting View more: enter image description here

The returned JSON has the following format:

{"Heading":"17952 träffar på Alla mottagningar","Query":"","Region":null,"NextPage":3,"Page":2,"BatchSize":10,"BatchText":"Visa 10 till","TotalHits":17952,"SortOrder":"name","Latitude":0.0,"Longitude":0.0,"Bounds":null,"SearchHits":[{"HsaId":"SE162321000255-O23228","FriendlyUrl":"/hitta-vard/kontaktkort/A5-Psykoterapi-Katia-Karlsson-Carli-AB-Lund/","DisplayName":"A5 Psykoterapi Katia Karlsson Carli AB, Lund","Address":"Stortorget 1, Lund","PhoneNumber":"073-046 26 68","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE162321000255-O22542","FriendlyUrl":"/hitta-vard/kontaktkort/A5Psykoterapi-Gunilla-Lundqvist-Lund/","DisplayName":"A5Psykoterapi - Gunilla Lundqvist, Lund","Address":"Stortorget 1 5:e vån, Lund","PhoneNumber":"070-624 13 97","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":55.703161529482479,"Longitude":13.193039057187006},{"HsaId":"SE2321000057-6SV4","FriendlyUrl":"/hitta-vard/kontaktkort/A6-Ogonklinik-AB/","DisplayName":"A6 Ögonklinik AB","Address":"Batterigatan 9 NB, Jönköping","PhoneNumber":"036-860 20 30","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":57.768032303027383,"Longitude":14.202798620555548},{"HsaId":"SE162321000024-0059892","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Evelina-Linder-KBT/","DisplayName":"AB Evelina Linder KBT","Address":"Drottninggatan 1A, Uppsala","PhoneNumber":"073-593 00 73","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.858328320441558,"Longitude":17.638292776307694},{"HsaId":"SE162321000024-0052597","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Forsberg-KBT-konsult/","DisplayName":"AB Forsberg KBT-konsult","Address":"Trädgårdsgatan 5A, Uppsala","PhoneNumber":"070-818 17 11","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.856845411620185,"Longitude":17.635819529969204},{"HsaId":"SE2321000016-C7H4","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Lyhord-Ostermalmstorg/","DisplayName":"AB Lyhörd - Östermalmstorg","Address":"Östermalmstorg 1,STOCKHOLM","PhoneNumber":"08-425 004 00","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.336237708592563,"Longitude":18.079317099784653},{"HsaId":"SE2321000016-BH0B","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Suavis-horsel-Solna-Business-park/","DisplayName":"AB Suavis hörsel, Solna Business park","Address":"Svetsarvägen 15,2 tr,SOLNA","PhoneNumber":"010-207 11 77","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.35928477168008,"Longitude":17.980058512140353},{"HsaId":"SE2321000016-56DM","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Annette-Goransson/","DisplayName":"AB Vackra Tänder Annette Göransson","Address":"Drottninggatan 71A,STOCKHOLM","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592153903674,"Longitude":18.059258535271329},{"HsaId":"SE5564844115-106Q","FriendlyUrl":"/hitta-vard/kontaktkort/AB-Vackra-Tander-Norrmalm/","DisplayName":"AB Vackra Tänder, Norrmalm","Address":"Drottninggatan 71 A, 3 tr,","PhoneNumber":"08-21 52 62","HasMvkServices":false,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33592396728109,"Longitude":18.059118082991937},{"HsaId":"SE2321000016-97P2","FriendlyUrl":"/hitta-vard/kontaktkort/ABA-Ogonklinik-i-Alvik/","DisplayName":"ABA Ögonklinik i Alvik","Address":"Tranebergsplan 3,,BROMMA","PhoneNumber":"08-124 440 10","HasMvkServices":true,"VaccinatesForFlu":false,"VaccinatesForHpv":false,"Distance":0.0,"Latitude":59.33516807973394,"Longitude":17.978288641135208}],"HasZeroHits":false}

I'd be grateful for some initial lines of code that would get me going.

Adam Robinsson
  • 1,651
  • 3
  • 17
  • 29
  • In order to avoid that people need to go and do the same "digging through the detaíls" that you did in order to gain the knowledge necessary to understand your question, please share all the technical details you have found out. What URLs, how are they connected, what data do you want to scrape exactly. Everything you know so far. – Tomalak Feb 09 '20 at 10:02
  • Technical details. Write out the URLs. Write out the JSON examples, show HTML snippets. Everything that is necessary to understand the problem. No screenshots, show everything in plain text in your question. Don't only link external content, if they change anything then your question and any answers won't make any sense anymore. – Tomalak Feb 09 '20 at 10:07
  • Thanks Tomalak, you're absolutely right. I've added the info to the best of my ability. – Adam Robinsson Feb 09 '20 at 10:19
  • That's a lot better indeed. – Tomalak Feb 09 '20 at 10:29
  • I've looked some more into this, and your task seems to be a good candidate for using the [1177.se API](https://www.1177.se/om-1177-vardguiden/1177-vardguiden-pa-webben/intresseanmalan-for-en-api-nyckel/) instead of for HTML scraping. Before you sink more time into that, consider getting an API key and using the officially supported way of interacting with their database. It will also be much easier to write code against the API. – Tomalak Feb 10 '20 at 13:26

2 Answers2

1

This code may or may not work, but this is the approach I would take given the problem you are facing. You can insert {} into the start url to use format. Also, when you loop through data['quotes'] you're now dealing with a JSON object and not a Scrapy selector. So there is no need to call .get().

import json
import scrapy


class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    start_urls = ['https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p={}&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a']

    def start_requests(self):
        # You may also need to replicate the headers used in the requests made to this URL.
        yield scrapy.Request(self.start_urls[0].format('1'))

    def parse(self, response):
        data = json.loads(response.body)
        for item in data['quotes']:
            # remember you're no longer dealing with a scrapy selector but now a json object
            yield {
                'text': item['text'],
                'author': item['name'],
                'tags': item['tags'],
            }
        if data['has_next']:
            # convert to integer to do addition
            next_page = int(data['page']) + 1
            yield scrapy.Request(self.start_urls[0].format(next_page), callback=self.parse)
ThePyGuy
  • 1,025
  • 1
  • 6
  • 15
1

This should do the trick:

Headerz = {
    'accept': 'text/html, */*; q=0.01',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'pragma': 'no-cache',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
}

class SpidyQuotesSpider(scrapy.Spider):
    name = 'spidyquotes'
    start_urls = ['https://www.1177.se/api/hjv/search?batchsize=10&caretype=&componentname&cs=false&location=&p={}&q=&s=name&sortorder=name&st=4af2ed43-1154-4363-ae6b-718f9b84d23a']

    def start_requests(self):
        # You may also need to replicate the headers used in the requests made to this URL.
        yield scrapy.Request(self.start_urls[0].format('1'), headers=Headerz)

    def parse(self, response):
        data = json.loads(response.body)
        # you have json data in data variable, do what you intent to do so
        try:
            # paginate
            if not data['NextPage'] is None:
                nextpage_number = data['NextPage']
                nexturl = self.start_urls[0].format( str(nextpage_number) )
                yield scrapy.Request(nexturl, headers=Headerz)
        except:
            pass

The trick here is to use proper headers!

Janib Soomro
  • 446
  • 6
  • 12