I had an issue with opening a .docx file and i wondered "what about getting the data straight from the bin?".
My first attempt was to just check the bin data in a .docx, the doc only contains a "Hello World" in Calibri 11.
The next are parts of the content in binary.
From what i could tell, it was compressed or encrypted. And the fourth part of the bin seemed to be the largest.
After researching for a while, I found someone saying ".docx are basically .zip with .xml files inside" So i tried turning the .docx into a .zip and extracting files and worked just fine. But i was like "ok, cool, how do i get that compressed data straight from the .docx file compressed?". So i've been researching on how to manually "decompress" .zip files. Found out the method of compression, zip seems to be able to handle multiple methods, the most common is deflate which uses LZ77 compressor so i gotta get to understand this algorithm in order to decompress the .docx file data but not its content. Not a big progress yet on the matter, but i'm still making my way through.
My reasoning to solve this problem is:
- Find file section of interest
- Decompress the section of interest
- Get text
From what i get on how the zip compression works, maybe the decompression should be made in the whole file.
TLDR:
I want to manually decompress the data in a docx, my objective is not to get the data but to understand the compression process and the structure of a docx, so i can get it from this:
(Compressed part of what i think is the Text in the .docx)
To this:
(Uncompressed part of the Text from the extracted document.xml)
Is there an already existing method for doing what i'm willing to do? Is it too crazy based on how compression works?