This is overall an interesting problem - extracting something from inside a JavaScript code which is inside the HTML code.
Basically, it means that you first need to do HTML parsing - locate the desired script
element and get its text. Then, the second step would be to extract the desired number from inside the realty
object.
If you would go for a regular expression approach, you may actually re-use a regular expression to both locate the desired script
element and also to extract the desired variable (BeautifulSoup
allows to apply regular expression patterns to find/filter the elements):
import re
from bs4 import BeautifulSoup
html = """
<div class="heading-dom view">
<script type="application/javascript">
window.realty = {"user_id":4243456};
<!--window.agency = < %- JSON.stringify(agency) % >;-->
<!--window.agency = < %- JSON.stringify({}) % >-->
</script>
</div>"""
pattern = re.compile(r'\{"user_id"\s*:\s*(\d+)\}')
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
# prints 4243456
Let's break down \{"user_id"\s*:\s*(\d+)\}
here:
- backslashes are used to escape characters that have special meaning in a regular expression syntax
\s*
means - zero or more space characters (put it there just in case you are gonna have extra spaces around the :
in the object definition)
\d+
means "one or more digits"
- parenthesis define a capturing group - that's a way for us to extract a specific part of a string into a group which we then access via
.group(1)
Note that the simple \d+
expression suggested by @Evyatar is too broad and you may easily get false positives.
Here are some of the similar topics that contain some other options as well: