How does Opera Turbo compress the data (cache)?

Question

I have an Opera browser with "Opera Turbo" enabled. It is a proxy, which recompress HTML into smaller format. I have a file from opera cache, which was compressed by turbo from 2000 kb to 500 kb. How can I uncompress this file into readable form (the original file have almost no html tags, just 8bit text, "<p>" tags, and html header/footer)?

Here is an example of such file:

.opera$ hexdump -C cache/turbo/g_0000/opr00003.tmp
00000000  78 da 6c 8f bf 4e c4 30  0c c6 67 fa 14 26 48 6c  |xзl▐©Nд0.фgЗ.&Hl|
00000010  a1 1c 12 d3 25 1d f8 37  82 54 f1 02 69 63 48 74  |║..с%.Ь7┌TЯ.icHt|
00000020  69 52 12 97 d2 b7 ed 88  40 80 b8 05 06 06 7a 57  |iR.≈р╥М┬@─╦...zW|
00000030  09 21 84 27 fb f3 cf 9f  6d 61 a8 71 45 26 0c 2a  |.!└'ШСо÷ma╗qE&.*|
00000040  5d 64 3b a2 41 52 60 88  5a 8e 77 9d bd 97 ec 34  |]d;╒AR`┬Z▌w²╫≈Л4|
00000050  78 42 4f fc 7a 68 91 41  3d 57 92 11 3e 50 be 99  |xBOЭzh▒A=W▓.>P╬≥|
00000060  5d 42 6d 54 4c 48 b2 b7  5e 87 3e f1 c5 d1 f1 82  |]BmTLH╡╥^┤>ЯеяЯ┌|
00000070  fd 78 79 d5 a0 64 1a 53  1d 6d 4b 36 f8 5f 26 ef  |Щxyу═d.S.mK6Ь_&О|
00000080  eb 71 fd f5 f8 97 5d e1  d0 87 a8 d3 ff 20 59 72  |КqЩУЬ≈]Ап┤╗сЪ Yr|
00000090  58 94 5d 4a 56 41 f0 40  06 e1 12 09 f6 1b ad 92  |X■]JVAП@.А..Ж.╜▓|
000000a0  59 c2 8c 8a 7c e6 32 91  cf 9f 09 67 fd 0a 22 3a  |Yб▄┼|Ф2▒о÷.gЩ.":|
...

and here is a part of original file (I'm not sure is it the really original file or not, but very likely it is):

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
    <meta name="description" content="статьи">
    <meta name="keywords" content="статьи">
    <title>Russia on the Net &mdash; статьи</title>
</head>
<link rel="stylesheet" href="/rus/style.css">
<body bgcolor="#FFFFFF">
<center>
...

Size of compressed file is 3397 and of original ~ 8913 bytes. Original file is compressible by bzip2 to 3281 byte; by gzip to 3177 byte; by lzma to 2990 byte; by 7z to 3082 byte; by zip to 3291 byte.

Update: I have information (from chrome opera-mini extension http://ompd-proxy.narod.ru/distrib/opera_mini_proxy.crx - unpack it with 7-zip) that opera mini uses this to unpack data webodf/src/core_RawInflate.js Can this file help me?

Hm, why do you want to do so? Simply turn off Opera Turbo and load the page again to get it uncompressed :) — hallvors, Aug 01 '11 at 13:08
It is a task for offline, where is no chance to reload the page. — osgx, Aug 01 '11 at 13:20
@Jitamaro, no, it does not recognize the format and ask me to select application to open. — osgx, Aug 01 '11 at 17:03
Could you please upload the example file `opr00003.tmp` anywhere for further analysis? — schnaader, Aug 04 '11 at 23:28
No, now I can upload the 0003.tmp; but I remade the turbo for exactly known file and did a sniff with wireshark. Wireshark was able to unpack with "Content-encoded entity body (deflate): 221 bytes -> 250 bytes". Here: http://pastebin.com/asW4chbn — osgx, Aug 04 '11 at 23:57

schnaader · Accepted Answer · 2011-08-04T23:35:08.160

5

The first two bytes 78 DA are a valid 2 byte zLib header (see section 2.2 on CMF and FLG) that precedes deflate compressed data. So the file could be compressed using zLib/deflate.

For a first quick test, you can use my command-line tool Precomp like this:

precomp -v -c- -slow opr00003.tmp

It will report zLib compressed streams and how big they are decompressed ("... can be decompressed to ... bytes"). If this is successful (returns a decompressed size close to the original filesize you know), use your favourite programming language along with the zLib library to decompress your data.

Also note that if you're lucky, the stream (or a part of it) can be recompressed bit-to-bit identical by Precomp and the output file opr00003.pcf contains (a part of) the decompressed data preceded by a small header.

EDIT: As osgx commented and further analysis showed, the data can not be decompressed using zLib/deflate, so this is still an unsolved case.

EDIT2: The update and especially the linked JS show that it is deflate, but it seems to be some custom variant. Comparison with the original code could help as well as comparison to original zLib source code.

Additionally, the JS code could of course be used to try to decompress the data. It doesn't seem to handle the 2 byte header, though, so perhaps these have to be skipped.

edited Aug 04 '11 at 23:35

answered Aug 04 '11 at 19:58

schnaader

49,103
10
104
136

I'm eager to hear if this solves the problem, I hope @osgx lets us know.... – Dan Aug 04 '11 at 20:33
but `file` utility was not able to detect format with all the magic inside it. – osgx Aug 04 '11 at 20:40
@osgx: Well, as I said, the header in this case is only 2 bytes long and in theory, every 2 byte that are divisible by 31 are a valid header. If `file` would detect such files as `zLib compressed data`, there would be many false detections. – schnaader Aug 04 '11 at 20:47
no luck. even with -brute: Recompressed streams: 0/22 Brute mode streams: 0/22 – osgx Aug 04 '11 at 20:51
1

Yes, I also ran the data you posted in your question through a zLib stream analyzer I wrote and while the header is correct (highest compression, correct FCHECK), the following huffman data doesn't make sense at all. This means the data wasn't compressed using zLib/deflate or it is encrypted somehow after the header (but I doubt this). – schnaader Aug 04 '11 at 21:02
Can it be a some HTTP-transport encoding, like "Transfer-encoding: chunked"? – osgx Aug 04 '11 at 22:11
Something like this could be possible, but at least chunked encoding would be easy to identify by the ASCII numbers used which don't seem to be in your example. – schnaader Aug 04 '11 at 23:27

score 3 · Answer 2 · answered Aug 04 '11 at 21:38

There are different file types in opera turbo cache. The first one is cited in question; some files are unpacked (css and js), and there is Z-packed multifile tar-like archive for images (VP8, detected by plain-text RIFF,WEBP,VP8 magics):

Example of Z-packed file header:

 5a 03 01 1c 90 02 0a 22 03 18 2a (RIFF data first img) (RIFF data second img)
 (RIFF data third img)

RIFF container is clearly visible and it has length field, so I suggest a description:

 5a - magic of format
    03 - number of files
       01 - first file (riff size=0x1c90)
          1c 90 - big-endian len of first file
                02 - second file (riff size=0a22)
                   0a 22 - len of second file
                         03 - third file (riff size=182a)
                            18 2a
                                  52 49 46 46 == "RIFF" magic of first file

Another example of Z-file with JPGs ("JFIF" magic is visible, ffd8ff jpeg-marker is invisible; 8 files inside):

0000000: 5a08 0118 de02 1cab 0308 0804 162c 0531  Z............,.1
0000010: 4d06 080f 070a 4608 0964"ffd8 ffe0 0010  M.....F..d......
0000020: 4a46 4946 0001 0101 0060 0060 0000 ffdb  JFIF.....`.`....

Another detected (by file) type of file is "<000"-file with example header of (hex) "1f 8b 08 00 00 00 00 00 02 ff ec 52 cb 6a c3 30 10 fc 15 63". file says it is "gzip compressed data, max compression", and it is just unpacked by any gzip.

How does Opera Turbo compress the data (cache)?

2 Answers2