-1

I am pretty new to the Python and i am trying to get some data from website. But i am struggling that when i execute code below. I am getting values from the page in the apostrophe which is not valid format of the json.

something like

[{'companyId': 1,
                                              'companyPhotoId': 9120,
                                              'description': 'Pracovní '
                                                             'prostory',
                                              'fileId': '4ec99adf-f89b-481d-8f6d-3d2f49b1f1f1',
                                              'isThumbHorizontal': False,
                                              'order': 1,
                                              'thumbnailFileId': 'e00c3c9c-55d3-4bad-bd5a-d485bfab2986'},
                                             {'companyId': 1,
                                              'companyPhotoId': 9121,
                                              'description': 'mDevcamp 2018',
                                              'fileId': '089dfef5-5c89-4e56-ad49-c6458d258a3f',
                                              'isThumbHorizontal': False,
                                              'order': 2,
                                              'thumbnailFileId': '411cbd66-dbb4-4385-8ae9-cc89f8787346'},
                                             {'companyId': 1,
                                              'companyPhotoId': 9122,
                                              'description': 'Kancl 2018',
                                              'fileId': 'fcdadaeb-3960-45be-b575-0a0be34a73bc',
                                              'isThumbHorizontal': True,
                                              'order': 3,
                                              'thumbnailFileId': '7cd162e9-1d18-4629-b685-9b4246637fef'}]
import scrapy
from pprint import pprint
import json

class Project1SpiderSpider(scrapy.Spider):
    name = 'project1-spider'
    allowed_domains = ['somewebsite']
    start_urls = ['somewebsite'.format(i + 1) for i in range(2000)]

    def parse(self, response):
        results = json.loads(response.body)
        pprint(results)

i need to get it in the format like this

[{"companyId": 1,
                                              "companyPhotoId": 9120,
                                              "description": "Pracovní "
                                                             "prostory",
                                              "fileId": "4ec99adf-f89b-481d-8f6d-3d2f49b1f1f1",
                                              "isThumbHorizontal": False,
                                              "order": 1,
                                              "thumbnailFileId": "e00c3c9c-55d3-4bad-bd5a-d485bfab2986"},
                                             {"companyId": 1,
                                              "companyPhotoId": 9121,
                                              "description": "mDevcamp 2018",
                                              "fileId": "089dfef5-5c89-4e56-ad49-c6458d258a3f",
                                              "isThumbHorizontal": False,
                                              "order": 2,
                                              "thumbnailFileId": "411cbd66-dbb4-4385-8ae9-cc89f8787346"},
                                             {"companyId": 1,
                                              "companyPhotoId": 9122,
                                              "description": "Kancl 2018",
                                              "fileId": "fcdadaeb-3960-45be-b575-0a0be34a73bc",
                                              "isThumbHorizontal": True,
                                              "order": 3,
                                              "thumbnailFileId": "7cd162e9-1d18-4629-b685-9b4246637fef"}]

Could you please help me how the code should look like instead please.

Thank you very much

Hartemajz
  • 21
  • 3
  • 1
    I do not quite understand your problem. Where is the difference in the two json examples? Does your code produce an exception of some kind? How do you know what the expected answer is? – Lydia van Dyke Apr 05 '20 at 13:08

1 Answers1

0

When you do json.loads(response.body) it will convert from json string into python object. And you got the result because you print the python object.

To get the result you want, you should either print the original json: print(response.body) or if you want to print it nicely you should convert the python object into json string with indent, i.e. print(json.dumps(results, indent=2)).

    def parse(self, response):
        # Get a python object
        results = json.loads(response.body)
        # Pretty print the json
        print(json.dumps(results, indent=2))
Yosua
  • 411
  • 3
  • 7
  • Thank you so much!!! It works. Could you please help me how to get that in unicode format and generate json file where the data will be saved? I have values in the body as par\u0165\u00e1ka – Hartemajz Apr 05 '20 at 13:20
  • For unicode maybe you can refer to https://stackoverflow.com/questions/18337407/saving-utf-8-texts-in-json-dumps-as-utf8-not-as-u-escape-sequence – Yosua Apr 05 '20 at 13:53