10

I have a new project I am working on that involves fetching a webpage, (using PHP and cURL) parsing the HTML and javascript out of it and then handling the data in the results.

Basically I hit a brick wall when the site uses javascript to fetch its data by AJAX. In this case, the initial data will not appear in the fetched page unless the javascript is run in a browser.

Are there any PHP libraries for this? (I suspect not, but I could be wrong.)

I would really rather build this as a server-based solution, otherwise I am forced to have to build an application for this and using mozilla and/or IE runtime libraries - which kind of defeats the purpose.

Talvi Watia
  • 1,070
  • 1
  • 14
  • 28
  • Update on the project: My server is a LAMP machine. My test server is WAMP on my laptop, which is a mirror of that. This needs to be *SERVER BASED* not browser based. So running JAVA or JavaScript in-browser is not an option. (jQuery also seems to work browser based.) In other words, a cronjob would call the PHP file, which in turn would cURL a webpage. The webpage would be parsed for HTML and any javascript would need to be interpreted into a DOM model. Rhino looks promising, but JAVA is not part of the shell build on the server. V8/SquirrelFish is C++ code I would need to convert to PHP. – Talvi Watia Nov 20 '09 at 08:56
  • 2
    don't comment on your answer, just edit it – hasen Nov 20 '09 at 14:11
  • Update x2: There is a solution using .NET and IE in a root shell. I personally won't touch this with a ten foot pole!!! It gives me a headache to imagine all the insane headaches with javascript rendered for M$ and the rest of the known world using everything STANDARD. Of course this uses a dedicated host. Of course this is *NOT* web based. And yes now you might be wondering, how could you do this with IIS instead of WAMP?>>>>>> << – Talvi Watia Jan 29 '10 at 08:57

8 Answers8

17

You will need:

  • one JavaScript interpreter
  • one DOM Level 2 Core and HTML implementation
  • 500g of non-standard but commonly-used DOM extensions
  • a pinch of DOM Level 2 Style (which might mean also a CSS interpreter and layout engine)
  • yoghurt pots, round-ended scissors and sticky-back plastic

Once you have assembled your components (remember to get a grown-up to help you with the sandboxing), you'll find what you have is essentially indistinguishable from a web browser.

JAVA is not part of the shell build on the server. V8/SquirrelFish is C++ code I would need to convert to PHP.

Porting a JS engine to PHP would be a huge task, and the resulting performance likely horrible. You can't even really get away with a nearly-solution on JavaScript any more, since so many pages are using hideously complex libraries like jQuery to do everything, which will require in-depth JS support.

I don't think you're going to be able to do this purely in PHP. You'll have to hook up Java/Rhino/HTMLUnit or a proper web browser like Mozilla. If your hosting environment doesn't give you the flexibility you need to compile and deploy that sort of thing, you'd have to move to a better hosting setup with a shell (preferably VPS).

If you can avoid this unpleasantness some other way, by special-casing known pages' AJAX access, do that.

bobince
  • 528,062
  • 107
  • 651
  • 834
4

You can run a JavaScript engine such as Rhino on a server.

Here's a few alternatives:

  • Rhino (Java based)
  • V8 (Used by Chrome, C++)
  • SquirrelFish (C++)

While these can run JS, I'm not sure if what you do is the best approach. However, since you haven't specified the purprose of your program I can't offer any suggestions with that regard.

Jani Hartikainen
  • 42,745
  • 10
  • 68
  • 86
  • Not sure about the others, but Rhino won't be able to run most client-side JavaScript on its own, because it doesn't implement the DOM. – Ben Dunlap Nov 20 '09 at 07:04
4

You'll have to go one step further than Rhino if you want to execute real live web pages, because the JavaScript on those pages will expect to be able to use objects that are native to a browser environment. A server-side JavaScript engine like Rhino won't have those objects.

John Resig (creator of jQuery) started a project called Env.js a couple of years ago; it might be what you're looking, for but I suspect you'll have a tough time getting consistent results from a wide variety of web pages. Here's his initial blog post about it:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

Some similar projects are named in that post's comments.

Ben Dunlap
  • 1,838
  • 1
  • 16
  • 17
3

Previously asked here: headless internet browser?

At Mozilla we get this question a lot. There's no good answer. What you want is a software library that implements pretty much everything a browser needs to do (at least as far as networking, JavaScript, HTML parsing, and the DOM), but with no display.

The closest thing I know of is HTMLUnit (in Java).

Community
  • 1
  • 1
Jason Orendorff
  • 42,793
  • 6
  • 62
  • 96
1

I know you have said no Java, but for reference you might be interested in QT Jaambi. They have an implementation of webkit which yo ucan run in headless mode.

Joel
  • 29,538
  • 35
  • 110
  • 138
1

All these answers seem to presume that there is no possibility of php JavaScript emulation, but there is a near-fully-compliant open-source php JavaScript emulator here:

http://www.sitepoint.com/blogs/2006/01/19/j4p5-javascript-for-php5/

Combined with Env.js, you could get pretty close to a full server-side js execution solution.

Nick Lockwood
  • 40,865
  • 11
  • 112
  • 103
0

you could take a look in rhino. It use java, never heard of a PHP port.

Are you obligated to run the actual javascript?

RageZ
  • 26,800
  • 12
  • 67
  • 76
0

Tbh you will have a harder time of just using a JS engine as you also have to create the environment of a browser scripting engine such as the DOM and window objects. If you are running on a Windows server then you could fairly easily use the IE COM objects to load and execute the web page, accessing the DOM programatically and pulling the contents back out. As for your server being Linux and/or Mozilla I unfortunately have no experience.

But really what are you trying to do?

tyranid
  • 13,028
  • 1
  • 32
  • 34