0

I am trying to extract structured data within a json statement inside a html page. Therefore I retrieved the html and got the json via xpath:

json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())

The data starts like this:

response.xpath('//*[@id="product"]/script[2]/text()').extract_first()
"\r\ndataLayer.push({\r\n\t'event': 'EECproductDetailView',\r\n\t'ecommerce': {\r\n\t\t'detail': {\r\n\r\n\t\t\t'products': [{\r\n\t\t\t\t'id': '14171171',\r\n\t\t\t\t'name': 'Gingium 120mg',\r\n\t\t\t\t'price': '27.9',\r\n\r\n\t\t\t\t'brand': 'Hexal AG',\r\n\r\n\r\n\t\t\t\t'variant': 'Filmtabletten, 60 Stück, N2',\r\n\r\n\r\n\t\t\t\t'category': 'gedaechtnis-konzentration'\r\n\t\t\t}]\r\n\t\t}\r\n\t}\r\n});\r\n"

Sample structured json:

<script>
dataLayer.push({
    'event': 'EECproductDetailView',
    'ecommerce': {
        'detail': {

            'products': [{
                'id': '14122171',
                'name': 'test',
                'price': '27.9'
            }]
        }
    }
});
</script>

The error message is:

>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first())
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)

I also tried to decode:

>>> json.loads(response.xpath('//*[@id="product"]/script[2]/text()').extract_first().decode("utf-8"))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
>>>

How can I pull the product data into a python dictionary?

merlin
  • 2,717
  • 3
  • 29
  • 59
  • Take a look at this [post](https://stackoverflow.com/questions/19483351/converting-json-string-to-dictionary-not-list) – pheeper Apr 16 '20 at 20:14
  • Thank you and sorry I believe I did not ask the right question. Edited question to make the problem more clear. – merlin Apr 16 '20 at 20:43

1 Answers1

1

Many issues exist in your approach that I will discuss them below. You want to parse the value passed to push function as json and you have this as input:

dataLayer.push({
    'event': 'EECproductDetailView',
    'ecommerce': {
        'detail': {

            'products': [{
                'id': '14122171',
                'name': 'test',
                'price': '27.9'
            }]
        }
    }
});

Issues:

  1. This data is raw. You shouldn't pass it directly to json.loads, to resolve this try to grab {'event' .... } from your string via regex or some string interpolation. For example if your data format is always like this and other javascripts are not defined in scope via {} then grab the index of first { and last } and do substring to get the main data.
    1. This data contains ' as string indicators, but json standard use double quotes ". You should take care of replacing them as well.

After resolving issues you can use json.loads to parse your input.

Amin Rezaei
  • 376
  • 2
  • 11
  • Thank you. That helped me to get the value. I am using: re.findall(r"(?<='id': ')\d{6,8}", info) I hope that this will stay robust. – merlin Apr 16 '20 at 21:40