Error opening megawarc archive from Python

Question

I've found myself having to use a python script to access a webarchive.

What I have is a 'megawarc' web archive file from http://archive.org/details/archiveteam-fanfiction-warc-11. I need to un-megawarc this, using the python script found at https://github.com/alard/megawarc.

I'm trying to run the 'restore' command, and I have the three files needed (FILE.warc.gz, FILE.tar, and FILE.json.gz) from the first link.

I have both python 2.7 and 3.3 installed.

--------------update--------------

I've ran both this method..

python megawarc restore FILE

and this method..

Make sure you have the files megawarc and ordereddict.py in the same directory, with the files you want to convert. Rename the file megawarc to megawarc.py Open a python console in this directory

Type the following code (line by line) :

import sys
sys.argv = ['megawarc','restore','FILE']
import megawarc
megawarc.main()

using python 2.7, and this is what I get..

c:\Python27>python megawarc restore FILE
Traceback (most recent call last):
  File "megawarc", line 563, in <module>
main()
  File "megawarc", line 552, in main
mwr.process()
  File "megawarc", line 460, in process
self.process_entry(entry, tar_out)
  File "megawarc", line 478, in process_entry
entry["target"]["offset"], entry["target"]["size"])
  File "megawarc", line 128, in copy_to_stream
raise Exception("End of file: %d bytes expected, but %d bytes read." % (buf_size, l))
Exception: End of file: 4096 bytes expected, but 236 bytes read.

Is there something else i'm missing?

I have the following files all in c:\python27

FILE.megawarc.json.gz

FILE.megawarc.tar

FILE.megawarc.warc.gz

megawarc

ordereddict.py

Is this some type of corrupt file error? Is there something i'm missing?

Pssh, nobody has the wrong mindset for programming. Simply the fact that you've been trying to solve a problem with programming means you could be a great programmer. It's all about problem solving. — Slater Victoroff, Jun 12 '13 at 12:04
So you executed the Python script you were given? Did you open a command shell to run it so you could see the error messages, if any? — duffymo, Jun 12 '13 at 12:05
No need for self-deprecation: Think about Stack Overflow like a cookbook: the title and descriptions are the names of the recipes, the answers are the instructions for making the food. A cookbook is a lot less valuable if the recipes are cluttered with side-comments. Comments, in SO are like the side bars where you can share interesting tidbits, or apologize/explain yourself! :) Welcome aboard! — Chris Pfohl, Jun 12 '13 at 13:06

score 6 · Answer 1 · edited Jun 12 '13 at 12:58

6

On the second link you provided, there are two important files :

megawarc
ordereddict.py

The executable script is megawarc. To run it, you have to launch it in a shell with

python megawarc restore FILE

Alternatively, if you're using a UNIX-based system. You can do

chmod +x megawarc

To give megawarc script executable property and then run it with

./megawarc restore FILE

Here, FILE is the actual name you should type if the 3 files you have are FILE.warc.gz, FILE.tar, and FILE.json.gz. You have to change this parameter by the common prefix to your 3 input files if needed.

EDIT :

Okay, i found an alternative that would work if you don't have a standard shell to start the script in command line. What you have to do is :

Make sure you have the files megawarc and ordereddict.py in the same directory, with the files you want to convert.
Rename the file megawarc to megawarc.py
Open a python console in this directory

Type the following code (line by line) :

import sys
sys.argv = ['megawarc','restore','FILE']
import megawarc
megawarc.main()

This should work, i've just tried it. Hope it will help.

edited Jun 12 '13 at 12:58

User

14,131
2
40
59

answered Jun 12 '13 at 12:08

ibi0tux

2,481
4
28
49

1

With the confusing caveat that FILENAME appears to be a filename prefix, not an actual filename. – kampu Jun 12 '13 at 12:12
>>> python megawarc restore test SyntaxError: invalid syntax >>> – Sarah Waters Jun 12 '13 at 12:29
megawarc is highlighted as being the invalid syntax – Sarah Waters Jun 12 '13 at 12:29
1

Oh, i see what's your problem. Actually, you're trying to start python as you are already in a python console. Do you have a shell ? (What's your OS ?) – ibi0tux Jun 12 '13 at 12:32
using windows 7, have the python 3.3 shell IDLE (python GUI) – Sarah Waters Jun 12 '13 at 22:06
@SarahWaters - do [this](http://stackoverflow.com/questions/3701646/how-to-add-to-the-pythonpath-in-windows-7) and then open PowerShell, `cd` to your directory with your files, and do `python megawarc restore FILE` etc. etc. If your directory has spaces in it, enclose the directory in quotes, e.g. `cd "D:/My Python Scripts/"` – 2rs2ts Jun 13 '13 at 01:24
@SarahWaters more specifically, do [this one](http://stackoverflow.com/a/14753412/691859). – 2rs2ts Jun 13 '13 at 01:27

Error opening megawarc archive from Python

1 Answers1