1

When I am writing scrapers, I always use excellent XPath querying language to extract data from HTML or XML.

Often I am working with dynamic HTML, and have a need to extract some variables from Javascript code, so I am compelled to write ugly regexps to do that.

I am looking for some better way to do this, without involving any heavy-weight Javascript interpretators like PhantomJS.

I know, that where is a lot of tools, which is parsing syntax into XML or JSON files, and looking for something like, that is usable for parsing JS syntax.

uhbif19
  • 3,139
  • 3
  • 26
  • 48

1 Answers1

2

You are right that "ugly regexps" can't really be used to process arbitrary JS (or any other standard programming language for that matter). You need a full fledged parser.

There aren't "lots of tools" that parse (language) syntax to XML. Most real language tools have parsers which build an internal AST data structure designed for efficient access, which the tool then uses to achieve its purpose (analysis, transformation, execution). You say "translate to its tree" as if that tree were unique; it isn't. The ASTs built are a function of the parsing technology, the grammar used, and what the designer thought was important to access, so no two language tools agree on what the AST should look like. Tree shapes are thus tool-dependent.

If you get your hands on the source code for any such tool, you can throw away its post-parsing machinery, and add code to walk the AST and dump XML; this is not particularly hard (although getting all the output character escaping/encoding right is a royal PITA). The XML you get will be shaped according to the original tools AST, of course. That means any tool you build to process the XML must implicitly understand the shape of the particular tool's parser that you started with.

I happen to build general-purpose program transformation machinery (see bio), which has parsers for many languages including JavaScript. We get the "I wish I had XML" request enough so our particular tool will produce XML by a flip of a command-line switch, using exactly the means described above. Here's a link to an SO question showing the XML output for Java, and one for C++. If you want to see one for JavaScript, I could produce that and attach here with only a bit of effort.

Community
  • 1
  • 1
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341