In migrating from a CMS that stored files in the database, over to a system that stores them in AWS S3, I cant seem to find any options other than reverse engineering the format from Java (the old system) and implementing it all myself from scratch in python, using either the java code or rfc1867 as a reference.
I have database dumps containing long strings of encoded files. I'm not 100% clear which binary file upload encoding has been used. But there is consistency between the first characters of each file types.
UEsDBBQA
is the first 8 characters in a large number of the DOCX file formats, andUEsDBBQABgAIAAAA
is the first 16 characters in more than 75% of the DOCX files.JVBERi0xLj
is the first 10 characters of many of the PDF files.
Every web application framework that allows file uploads has to decode these... so its a known problem. But I cannot find a way to decode these strings with either Python (my language of choice), or with some kind of command line decoding tool...
file
doesnt recognise them.
hachoir
doesnt recognise them.
Are there any simple tools I can just install, I dont care if they are in C, Perl, Python, Ruby, JavaScript or Mabolge, I just want a tool that can take the encoded string as input (file, stdin, I don't care) and output the decoded original files.
Or am I overthinking the algorithm to decode these files and it would be simpler than it looks and someone can show me how to decode them using pure python?