My goal is to be able to reduce time needed to look at specific sections from the middle of very large log files compressed to .xz format.
If the .xz files are for example 6GB compressed and 60GB uncompressed, using simple commands like xzcat <file> | tail -1
to simply look at the last line of the uncompressed file, you'd have to wait many minutes for the entire file to get decompressed.
From reading https://stackoverflow.com/a/34053829/12132601, my understanding is that .xz files are organised into blocks and it is possible to decompress specific blocks, if you can find the right starting position and length of the file to take. However I could not follow this:
You can get the list of block offsets with xz --verbose --list FILE.xz. If you want the last block, you need its compressed size (column 5) plus 36 bytes for overhead (found by comparing the size to
hd big.log.sp0.xz |grep 7zXZ
). Fetch that block using tail -c and pipe that through xz. Since the above question wants the last line of the file, I then pipe that through tail -n1:SIZE=$(xz --verbose --list big.log.sp.xz |awk 'END { print $5 + 36 }') tail -c $SIZE big.log.sp.xz |unxz -c |tail -n1
Specifically the part about the overhead of 36 and how he got it.
plus 36 bytes for overhead (found by comparing the size to
hd big.log.sp0.xz |grep 7zXZ
)
I've been reading https://tukaani.org/xz/xz-file-format.txt but I could not follow a lot of it. I did not find out where the 36 came from.
36 definitely did NOT work with the my file. I actually tried 1 to 100 and none worked.
The first 3 lines of my file looks like this with hd
:
00000000 fd 37 7a 58 5a 00 00 04 e6 d6 b4 46 04 c0 e2 c3 |.7zXZ......F....|
00000010 39 80 80 80 08 21 01 14 00 00 00 00 3e 0b 39 68 |9....!......>.9h|
00000020 e9 e2 3f f0 00 5d 00 18 8d 82 f9 18 7b b2 75 c6 |..?..]......{.u.|
And the first few lines xz -lvv <myxzfile>
looks like this:
<myxzfile> (1/1)
Streams: 1
Blocks: 4,080
Compressed size: 5,789.9 MiB (6,071,150,860 B)
Uncompressed size: 63.7 GiB (68,443,750,160 B)
Ratio: 0.089
Check: CRC64
Stream padding: 0 B
Streams:
Stream Blocks CompOffset UncompOffset CompSize UncompSize Ratio Check Padding
1 4,080 0 0 6,071,150,860 68,443,750,160 0.089 CRC64 0
Blocks:
Stream Block CompOffset UncompOffset TotalSize UncompSize Ratio Check CheckVal Header Flags CompSize MemUsage Filters
1 1 12 0 942,592 16,777,216 0.056 CRC64 e77988a5264b499e 20 cu 942,562 5 MiB --lzma2=dict=4MiB
1 2 942,604 16,777,216 887,748 16,777,216 0.053 CRC64 b1124241f57be325 20 cu 887,718 5 MiB --lzma2=dict=4MiB
1 3 1,830,352 33,554,432 836,008 16,777,216 0.050 CRC64 0b9ed8b7bd1be895 20 cu 835,978 5 MiB --lzma2=dict=4MiB
1 4 2,666,360 50,331,648 893,172 16,777,216 0.053 CRC64 4399327c125c6a13 20 cu 893,144 5 MiB --lzma2=dict=4MiB
1 5 3,559,532 67,108,864 757,964 16,777,216 0.045 CRC64 908e32d2276f5b4b 20 cu 757,933 5 MiB --lzma2=dict=4MiB
If I want to decompress just the 3rd block, naively I would think head -c 2666360 2022-06-16T00:00:00.xz | tail -c 836008 | unxz -c
would work but of course it doesn't. What is the starting position and length of the file I should be taking, and why?