Extracting a value of an object key from inside the script element in the HTML

Question

<div class="heading-dom view">
    <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>

My desired output is 4243456. How could I extract it using lxml of beautifulsoup?

score 7 · Accepted Answer · answered Dec 15 '17 at 14:42

This is overall an interesting problem - extracting something from inside a JavaScript code which is inside the HTML code.

Basically, it means that you first need to do HTML parsing - locate the desired script element and get its text. Then, the second step would be to extract the desired number from inside the realty object.

If you would go for a regular expression approach, you may actually re-use a regular expression to both locate the desired script element and also to extract the desired variable (BeautifulSoup allows to apply regular expression patterns to find/filter the elements):

import re
from bs4 import BeautifulSoup


html = """
 <div class="heading-dom view">
     <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>"""

pattern = re.compile(r'\{"user_id"\s*:\s*(\d+)\}')
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))
# prints 4243456

Let's break down \{"user_id"\s*:\s*(\d+)\} here:

backslashes are used to escape characters that have special meaning in a regular expression syntax
\s* means - zero or more space characters (put it there just in case you are gonna have extra spaces around the : in the object definition)
\d+ means "one or more digits"
parenthesis define a capturing group - that's a way for us to extract a specific part of a string into a group which we then access via .group(1)

Note that the simple \d+ expression suggested by @Evyatar is too broad and you may easily get false positives.

Here are some of the similar topics that contain some other options as well:

Extracting text from script tag using BeautifulSoup in Python

score 4 · Answer 2 · answered Dec 15 '17 at 11:52

You can extract the script tag's text using BeautifulSoup but in order to get the user_id you need to use regex

# Assuming it's the only number in the script's text
pattern = re.compile('\d+')
soup = BeautifulSoup(page, 'lxml')
for i in soup.select('script'):
    print(re.findall(pattern, i.text))

output

['4243456']

SIM · Answer 3 · 2017-12-15T21:19:23.487

String manipulation can be an option if you want to avoid using regex:

from bs4 import BeautifulSoup

content='''
<div class="heading-dom view">
    <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>
'''
soup = BeautifulSoup(content,'lxml')
item = soup.select('script')[0].text.split('user_id":')[1].split("}")[0]
print(item)

Output:

Extracting a value of an object key from inside the script element in the HTML

3 Answers3