1

So I have been trying to scrape out a value from a html that is a javascript. There is alot of javascript in the code but I just want to be able to print out this one:

var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153",

});

So I started by doing a code that looks like:

test = bs4.find_all('script', {'type': 'text/javascript'})
print(test)

The output I am getting is pretty huge so I am not able to post it all here but one of them is the javascript as I mentioned at the top and I want to print out only var spConfig=newProduct.Config.

How am I able to do that, to be able to just print out var spConfig=newProduct.Config.... which I later can use json.loads that convert it to a json where I later on can scrape it more easier?

For any question or something I haven't explained well. I will apprecaite everything in the comment where I can improve myself aswell here in stackoverflow! :)

EDIT:

More example of what bs4 prints out for javascripts

<script type="text/javascript">varoptionsPrice=newProduct.Options({
  "priceFormat": {
    "pattern": "%s\u00a0\u20ac",
    "precision": 2,
    "requiredPrecision": 2,
    "decimalSymbol": ",",
    "groupSymbol": "\u00a0",
    "groupLength": 3,
    "integerRequired": 1
  },
  "showBoths": false,
  "idSuffix": "_clone",
  "skipCalculate": 1,
  "defaultTax": 20,
  "currentTax": 20,
  "tierPrices": [

  ],
  "tierPricesInclTax": [

  ],
  "swatchPrices": null
});</script>,
<script type="text/javascript">var spConfig=newProduct.Config({
  "attributes": {
    "531": {
      "id": "531",
      "options": [
        {
          "id": "18",
          "hunter": "0",
          "products": [
            "128709"
          ]
        },
        {
          "label": "40 1\/2",
          "hunter": "0",
          "products": [
            "120151"
          ]
        },
        {
          "id": "33",
          "hunter": "0",
          "products": [
            "120152"
          ]
        },
        {
          "id": "36",
          "hunter": "0",
          "products": [
            "128710"
          ]
        },
        {
          "id": "42",
          "hunter": "0",
          "products": [
            "125490"
          ]
        }
      ]
    }
  },

  "Id": "120153"
});</script>,
<scripttype="text/javascript">document.observe('dom:loaded',
function(){
  varswatchesConfig=newProduct.ConfigurableSwatches(spConfig);
});</script>

EDIT update 2:

try:
    product_li_tags = bs4.find_all('script', {'type': 'text/javascript'})
except Exception:
    product_li_tags = []


for product_li_tag in product_li_tags:
   try:
        pat = "product.Config\((.+)\);"
        json_str = re.search(pat, product_li_tag, flags=re.DOTALL).group(1)
        print(json_str)
   except:
       pass

#json.loads(json_str)
print("Nothing")
sys.exit()
Hellosiroverthere
  • 285
  • 10
  • 19
  • Possible duplicate of [BeautifulSoup - extract json from JS](https://stackoverflow.com/questions/43852187/beautifulsoup-extract-json-from-js) – l'L'l Dec 17 '18 at 21:54

2 Answers2

2

You can use the .text function to get the content within each tag. Then, if you know that you want to grab the code that specifically starts with "varoptionsPrice", you can filter for that:

soup = BeautifulSoup(myhtml, 'lxml')

script_blocks = soup.find_all('script', {'type': 'text/javascript'})
special_code = ''
for s in script_blocks:
    if s.text.strip().startswith('varOptionsPrice'):
        special_code = s.text
        break

print(special_code)

EDIT: To answer your question in the comments, there are a couple of different ways of extracting the part of the text that has the JSON. You could pass it through a regexp to grab everything between the first left parentheses and before the ); at the end. Though if you want to avoid regexp completely, you could do something like:

json_stuff = special_code[special_code.find('(')+1:special_code.rfind(')')]

Then to make a usable dictionary out of it:

import json
j = json.loads(json_stuff)
print(j['defaultTax'])  # This should return a value of 20
Bill M.
  • 1,388
  • 1
  • 8
  • 16
1

I can think of possible 3 options - which one you use might depend on the size of the project and how flexible you need it to be

  • Use Regex to extract the objects from the script (fastest, least flexible)

  • Use ANTLR or similar (eg. pyjsparser) to parse the js grammar

  • Use Selenium or other headless browsers that can interpret the JS for you. With this option, you can use selenium to execute a call to get the value of the variable like this

Regex Example (#1)

>>> script_body = """
    var x=product.Config({
        "key": {"a":1}
});
"""
>>> pat = "product.Config\((.+)\);"
>>> json_str = re.search(pat, script_body, flags=re.DOTALL).group(1)
>>> json.loads(json_str)
{'key': {'a': 1}}
>>> json.loads(json_str)['key']['a']
1
gtalarico
  • 4,409
  • 1
  • 20
  • 42