2

Requirement is to keep a copy of complete web page at server side same as it is rendered on client browser as past records.These records are revisited.

We are trying to store the html of rendered web page. The html is then rendered using resources like javascript, css and image present at server side. These resources keep on changing. Therefore old records are no longer rendered perfectly.

Is there any other way to solve above? We are also thinking converting it into pdf using IText or apache FOP api but they does not consider javascript effect on page while conversion. Is there any APIs available in java to achieve this?

Till now, no approach working perfectly. Please suggest.

Edit: In summary,requirement is to create a exact copy of rendered web page at server side to store user activities on that page.

Abhishek Jain
  • 6,296
  • 7
  • 26
  • 34
  • Are you trying to capture just the information in the page, or the exact appearance of the page? – Dave Jan 17 '12 at 20:07

5 Answers5

1

wkhtmltopdf should do this quite nicely for you. It will take a URL, and return a pdf.

code.google.com/p/wkhtmltopdf

Example:

wkhtmltopdf http://www.google.com google.pdf
Hoppy
  • 136
  • 8
  • 1
    I found some useful url in support of this: http://stackoverflow.com/questions/5688585/how-to-use-wkhtmltopdf-in-java-web-application http://stackoverflow.com/questions/5506275/launching-wkhtmltopdf-from-runtime-getruntime-exec-never-terminates Let me give a try. Thanks for your help. – Abhishek Jain Jan 19 '12 at 17:54
1

Depending on just how sophisticated your javascript is, and depending on how faithfully you want to capture what the client saw, you may be undertaking an impossible task.

At a high level, you have the following options:

  1. Keep a copy of everything you send to the client
  2. Get the client to return back exactly whatever it has rendered
  3. Build your system in such a way that you can actually fetch all historical versions of the constituent resources if/when you need to reproduce a browser's view.

You can do #1 using JSP filters etc, but it doesn't address issues like the javascript fetching dynamic html content during rendering on the client.

Getting the client to return what they are seeing (#2) is tricky, and bandwidth intensive.

So I would opt for #3. In order to turn a website that renders dynamic content versioned, you have to do several things. First, all datasources need to versioned too. So any queries would need to specify the version. "Version" can be a timestamp or some generation counter that you maintain. If you are taking this approach, you would also need to ensure that any javascript you feed to the client does not fetch external resources directly. Rather, it should ask for any resources from your system. Your system would in turn fetch the external content (or reuse from a cache).

Dilum Ranatunga
  • 13,254
  • 3
  • 41
  • 52
  • Thanks for your approach. We are also using filters to fetch html content. I do not like this approach. Looking forward to better idea. – Abhishek Jain Jan 19 '12 at 17:00
0

The answer would depend on the server technology being used to write the HTML. Are you using Java/JSPs or Servlets or some sort of an HTTPResponse object to push the HTML/data to the browser?

If only the CSS/JS/HTML are changing, why don't you just take snapshots of your client-side codebase and store them as website versions?

If other data is involved (like XML/JSON) take a snapshot of those and version that as well. Then the snapshot of the client codebase as mentioned above with the contemporary snapshot of the data should together give you the exact rendering of your website as at that point of time.

Sid
  • 7,511
  • 2
  • 28
  • 41
0

A very resource-consuming requirement but...

You haven't written what application server you are using and what framework. If you're generating responces in your own code, you can just store it while generating.

Another possibility is to write a filter, that would wrap servlet's OutputStream and log everything that was written to it, you must just assure your filter is on the top of the hierarchy.

Another, very powerfull, easiest to manage and generic solution, however possibly the most resource-consuming: write transparent proxy server staying between user and application server, that would redirect each call to app server and return exact response, additionally saving each request and response.

Danubian Sailor
  • 1
  • 38
  • 145
  • 223
0

If you're storing the html page, why not the references to the js, css, and images too?

I don't know what your implementation is now, but you should create a filesystem with all of the html pages and resources, and create references to the locations in a db. You should be backing up the resources in the filesystem every time you change them!

I use this implementation for an image archive. When a client passes us the url of an image we want to be able to go back and check out exactly what the image was at that time they sent it (since it's a url it can change at any time). I have a script that will download the image as soon as we receive the url, store it in the filesystem, and then store the path to the file in the db along with other various details. This is similar to what you need, just a couple more rows in your table for the js, css, images paths.

Matt K
  • 6,620
  • 3
  • 38
  • 60