2
<div class="heading-dom view">
    <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>

My desired output is 4243456. How could I extract it using lxml of beautifulsoup?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
B.Misha
  • 107
  • 9

3 Answers3

7

This is overall an interesting problem - extracting something from inside a JavaScript code which is inside the HTML code.

Basically, it means that you first need to do HTML parsing - locate the desired script element and get its text. Then, the second step would be to extract the desired number from inside the realty object.

If you would go for a regular expression approach, you may actually re-use a regular expression to both locate the desired script element and also to extract the desired variable (BeautifulSoup allows to apply regular expression patterns to find/filter the elements):

import re
from bs4 import BeautifulSoup


html = """
 <div class="heading-dom view">
     <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>"""

pattern = re.compile(r'\{"user_id"\s*:\s*(\d+)\}')
soup = BeautifulSoup(html, "html.parser")
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))
# prints 4243456

Let's break down \{"user_id"\s*:\s*(\d+)\} here:

  • backslashes are used to escape characters that have special meaning in a regular expression syntax
  • \s* means - zero or more space characters (put it there just in case you are gonna have extra spaces around the : in the object definition)
  • \d+ means "one or more digits"
  • parenthesis define a capturing group - that's a way for us to extract a specific part of a string into a group which we then access via .group(1)

Note that the simple \d+ expression suggested by @Evyatar is too broad and you may easily get false positives.

Here are some of the similar topics that contain some other options as well:

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
4

You can extract the script tag's text using BeautifulSoup but in order to get the user_id you need to use regex

# Assuming it's the only number in the script's text
pattern = re.compile('\d+')
soup = BeautifulSoup(page, 'lxml')
for i in soup.select('script'):
    print(re.findall(pattern, i.text))

output

['4243456']

Evya
  • 2,325
  • 3
  • 11
  • 22
1

String manipulation can be an option if you want to avoid using regex:

from bs4 import BeautifulSoup

content='''
<div class="heading-dom view">
    <script type="application/javascript">
        window.realty = {"user_id":4243456};
        <!--window.agency = < %- JSON.stringify(agency) % >;-->
        <!--window.agency = < %- JSON.stringify({}) % >-->
     </script>
</div>
'''
soup = BeautifulSoup(content,'lxml')
item = soup.select('script')[0].text.split('user_id":')[1].split("}")[0]
print(item)

Output:

4243456
SIM
  • 21,997
  • 5
  • 37
  • 109