0

I am trying to scrape some attributes from ecommerce website but data is not stored in html it is stored in javascript script tag

i am trying to get productId,product,brand from script tag

import requests
from bs4 import BeautifulSoup

base_url = "https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar"

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

scripts = soup.find_all('script')[8]
scripts
james joyce
  • 483
  • 7
  • 24
  • What is the issue, exactly? Have you tried anything, done any research? You haven't even shared what the data actually looks like. – AMC Mar 05 '20 at 16:18

2 Answers2

3
import requests
from bs4 import BeautifulSoup
import json
import pyjsparser

r = requests.get(
    "https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar&p=1")

soup = BeautifulSoup(r.text, 'html.parser')

script = soup.findAll("script")[8].text

tree = pyjsparser.parse(script)

print(tree.keys())
  • @chitown88 yea it's sounds new for both of us :D https://stackoverflow.com/questions/60520118/how-to-scrape-phone-no-using-python-when-it-show-after-clicked were working on it yesterday. this is the second day of using it. – αԋɱҽԃ αмєяιcαη Mar 05 '20 at 16:24
  • @chitown88 @αԋɱҽԃαмєяιcαη `pyjsparser` sounds new for me too :D I already installed it to see what it can do. BTW: today I saw also [js2xml](https://stackoverflow.com/a/39306382/1832058) which parses JavaScript to XML ETree and can use `xpath` to search elements. – furas Mar 06 '20 at 07:41
  • @furas Oh glad to see you. your blog has solved for me a bigger question yesterday upon reviewing it. [decode issue](https://blog.furas.pl/python-why-requests-incorrectly-decodes-text-instead-of-utf8-gb.html) well done for the good job. i have reported that for requests team on github since 10 days also – αԋɱҽԃ αмєяιcαη Mar 06 '20 at 07:43
  • glad to hear that the blog is sometimes useful. After [question with this problem](https://stackoverflow.com/questions/60505000/python-lxml-cant-parse-japanese-in-some-case/) I was thinking to report it too. But finally I resigned. I met this problem only once or two so it wasn't so often problem. – furas Mar 06 '20 at 08:11
  • 1
    @furas it's usually happens especially for windows users. – αԋɱҽԃ αмєяιcαη Mar 06 '20 at 08:13
2

You can get script as text and remove window.__myx = from the beginning and you will have correct JSON data which you can convert to Python's dictionary using standard module json.

And then you can use keys and for-loop to get information

import requests
from bs4 import BeautifulSoup
import json

base_url = "https://www.myntra.com/men-formal-shirts?f=Collar%3AButton-Down%20Collar"

r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'html.parser')

# get .text
scripts = soup.find_all('script')[8].text

# remove window.__myx = 
script = scripts.split('=', 1)[1]

# convert to dictionary
data = json.loads(script)

for item in data['searchData']['results']['products']:
    print('product:', item['product'])
    print('productId:', item['productId'])
    print('brand:', item['brand'])
    print('---')

Result:

product: Louis Philippe Men White & Blue Slim Fit Checked Formal Shirt
productId: 11390900
brand: Louis Philippe
---
product: Hancock Men White Slim Fit Solid Formal Shirt
productId: 7460073
brand: Hancock
---
product: INVICTUS Men Navy Slim Fit Printed Semiformal Shirt
productId: 6970620
brand: INVICTUS
---
product: next Men White Slim Fit Solid Formal Shirt
productId: 11067410
brand: next
---
product: INVICTUS Men White & Green Slim Fit Printed Semiformal Shirt
productId: 2314014
brand: INVICTUS
---
product: Dazzio Men Black Modern Slim Fit Solid Formal Shirt
productId: 3009355
brand: Dazzio
---

etc.
furas
  • 134,197
  • 12
  • 106
  • 148