6

I want to extract the reviewCount from the script tag using beautiful soup. Tried different approach but didn't succeed.

<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>
martineau
  • 119,623
  • 25
  • 170
  • 301
free_123
  • 79
  • 1
  • 1
  • 3
  • _Tried different approach but didn't succeed._ Can you share those attempts? From the tag you shared, at seems that all you need to do is get the contents of the tag and parse the result. If you're struggling with extracting the content from the element, this is a duplicate of [Extract content within a tag with BeautifulSoup](https://stackoverflow.com/questions/5999407/extract-content-within-a-tag-with-beautifulsoup). If the issue is parsing the JSON, this is a duplicate of [How to parse JSON in Python?](https://stackoverflow.com/questions/7771011/how-to-parse-json-in-python). – AMC Apr 14 '20 at 22:19

3 Answers3

8

This should work, I am absolutely sure there is a more elegant approach:

import json
from bs4 import BeautifulSoup

html = '''
<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>
'''

soup = BeautifulSoup(html, 'html.parser')
res = soup.find('script')
json_object = json.loads(res.contents[0])

for language in json_object['languages']:
    print('{}: {}'.format(language['displayName'], language['reviewCount']))

output:

Toutes les langues: 573
français: 567
English: 6
James Powis
  • 609
  • 4
  • 16
3

Import json and load data into json and then iterarte to get all the reviewCount.

import json
html='''<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>'''

soup=BeautifulSoup(html,"html.parser")
item=soup.select_one('script[data-initial-state="review-filter"]').text
jsondata=json.loads(item)
for item in jsondata['languages']:
    print(item['reviewCount'])

Output:

573
567
6
KunduK
  • 32,888
  • 5
  • 17
  • 41
2
import re

html = '''<script type="application/json" data-initial-state="review-filter">
{"languages":[{"isoCode":"all","displayName":"Toutes les langues","reviewCount":"573"},{"isoCode":"fr","displayName":"français","reviewCount":"567"},{"isoCode":"en","displayName":"English","reviewCount":"6"}],"selectedLanguages":["all"],"selectedStars":null,"selectedLocationId":null}
</script>'''


match = [item.group(1) for item in re.finditer('reviewCount":"(.+?)"', html)]

print(match)

Output:

['573', '567', '6']