4

I'm trying to get a JavaScript var value from an HTML source code using BeautifulSoup.

For example I have:

<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>

I want something to return the value of the var "my" in Python

How can I achieve that?

Likak
  • 373
  • 1
  • 5
  • 19
L. K.
  • 139
  • 2
  • 12

4 Answers4

5

The simplest approach is to use a regular expression pattern to both locate the element via BeautifulSoup and extract the desired substring:

import re

from bs4 import BeautifulSoup

data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))

Prints hello.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
3

Another idea would be to use a JavaScript parser and locate a variable declaration node, check the identifier to be of a desired value and extract the initializer. Example using slimit parser:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<script>
var my = 'hello';
var name = 'hi';
var is = 'halo';
</script>
"""

soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=lambda text: text and "var my" in text)

# parse js
parser = Parser()
tree = parser.parse(script.text)
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
        print(node.initializer.value)

Prints hello.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

the answer, pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL) should get a wrong way, have to remove the line-end sign $ when set re.MULTILINE re.DOTALL at same time.

try with python 3.6.4

J.Z
  • 927
  • 6
  • 4
0

Building on @alecxe's answer, but considering a more complex case of an array of dictionaries - or an array of flat json objects:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<script>
var my = [{'dic1key1':1}, {'dic2key1':1}];
var name = 'hi';
var is = 'halo';
</script>
"""

soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=lambda text: text and "var my" in text)

# parse js
parser = Parser()
tree = parser.parse(script.text)
array_items = []
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
        for item in node.initializer.items:
            parsed_dict = {getattr(n.left, 'value', '')[1:-1]: getattr(n.right, 'value', '')[1:-1]
                for n in nodevisitor.visit(item)
                if isinstance(n, slimit.ast.Assign)}
        array_items.append(parsed_dict)
print(array_items)
Joey Baruch
  • 4,180
  • 6
  • 34
  • 48