2
<html>
<head>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
</head>
<body>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>

I want use xpath to catch all lable object in the html page above...

In [1]: import lxml.html as H

In [2]: f = open("test.html","r")

In [3]: c = f.read()

In [4]: doc = H.document_fromstring(c)

In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]

In [6]: a = doc.xpath('//a')[0]

In [7]: a.getparent()
Out[7]: <Element div at a01d41c>

I only get one don't generate by js~ but firefox xpath checker can find all lable!?

https://i.stack.imgur.com/0hSug.png

how to do that??? thx~!

<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
k99
  • 709
  • 2
  • 6
  • 13

3 Answers3

1

Not a clue about javascript-aware parser in python but you can use ANTLR to do the job. The idea is not mine so I'm leaving you the link.

It's actually quite cool because you can optimize your parser to selectively choose what instruction needs to be parsed (and executed).

Community
  • 1
  • 1
dierre
  • 7,140
  • 12
  • 75
  • 120
  • Nice! And from the same question you linked, http://pypi.python.org/pypi/python-spidermonkey/ seems to be consideration worth as well.. – redShadow Dec 28 '10 at 02:10
0

In Java there is Cobra. I don't know any Javascript-aware HTML parser for Python.

Paulo Scardine
  • 73,447
  • 11
  • 124
  • 153
0

Searching google for "javascript standalone runtime", I found jslibs: a "standalone JavaScript development runtime environment for using JavaScript as a general-purpose scripting language", based on "SpiderMonkey library that is Gecko's JavaScript engine".

Sounds great! I haven't tested yet, but it seems like this will allow you to run the javascript code you find in the page. I don't know how much it will be tricky, though..

redShadow
  • 6,687
  • 2
  • 31
  • 34
  • Not quite... it's just the language bindings, but doesn't have the DOM API. Most real-world javascript still won't work in it. By the time you add all the parts you need, you will have... a browser. Or, the closest thing I know if is [HtmlUnit](http://htmlunit.sourceforge.net/). – Keith Dec 28 '10 at 05:15