3

I am trying to find a way to open or convert a webarchive file to any other format in C#. The goal is an automated import system with as few restrictions on file type as possible. I cannot seem to find any way of converting the file other than using safari to open it.

gashach
  • 420
  • 5
  • 12
  • 1
    The webarchive format is propritary to Apple, and designed only for Safari per this [wiki article](http://en.wikipedia.org/wiki/Webarchive). What are you trying to do with it? – JNYRanger Jul 07 '14 at 15:23
  • Here's a github link of a someone who developed an app for extracting webarchives using Objective-C, which may be helpful to you: [GitHub -WebArchiveExtractor](https://github.com/robrohan/WebArchiveExtractor) – JNYRanger Jul 07 '14 at 15:29
  • I am trying to import it into a document management system. – gashach Jul 07 '14 at 15:45
  • Why don't you just load it into your document management system as a webarchive then? What are you trying to convert it into? – JNYRanger Jul 07 '14 at 16:01
  • the doc management system will not accept a webarchive file. i was ultimatly hoping to convert to pdf. – gashach Jul 07 '14 at 17:00

1 Answers1

0

Unfortunately what you are looking for cannot really be done. A webarchive is a proprietary file type made by Apple to display offline webpages in a Safari. This is a combination of xml, html, and binary data, but there are examples in Objective-C to convert the webarchive to a zip archive that contains the html and embedded images/media that was originally displayed on the website that was saved into the webarchive file.

Here is an Objective-C example from GitHub - WebArchiveExtractor

As for converting to PDF...not sure that can be done, you would be better off printing the webpage to PDF in the first place and then uploading that to your document management system.

Apparently though the webarchive filetype contains XML with binary encoded images/media similar to an MHTML file, so you may be able to figure out the format by viewing them in text editors and then writing a conversion utility, but there is very limited information on the web regarding the internal schema of the webarchive file format, so this may be a daunting task. However, since WebKit is open source you can see their code for created an archive and try to reverse it to build your converter. Here's the source code (in C++) for the archiving features in Safari, which actually looks like they are using mhtml, but I haven't explored deep enough to tell if it's exactly the same format: http://trac.webkit.org/browser/trunk/Source/WebCore/loader/archive

Good Luck!

JNYRanger
  • 6,829
  • 12
  • 53
  • 81
  • 1
    The down vote is for the common misinformation that the format is proprietary. was [fixed eleven years ago](http://trac.webkit.org/changeset/6580); WebResource, WebArchive and related APIs were published. – Graham Perrin Jul 24 '15 at 05:03
  • 1
    @GrahamPerrin As someone who needs to open one of these files right now, the "standard" being published isn't exactly helpful. In those eleven years, no one has written or ported a tool for reading .webarchive files on linux/unix/bsd. – Sparr Dec 01 '15 at 00:54
  • @Sparr [work in progress](https://forums.pcbsd.org/thread-20082-post-112422.html#pid112422) … – Graham Perrin Dec 02 '15 at 05:35