0

In migrating from a CMS that stored files in the database, over to a system that stores them in AWS S3, I cant seem to find any options other than reverse engineering the format from Java (the old system) and implementing it all myself from scratch in python, using either the java code or rfc1867 as a reference.

I have database dumps containing long strings of encoded files. I'm not 100% clear which binary file upload encoding has been used. But there is consistency between the first characters of each file types.

  • UEsDBBQA is the first 8 characters in a large number of the DOCX file formats, and UEsDBBQABgAIAAAA is the first 16 characters in more than 75% of the DOCX files.
  • JVBERi0xLj is the first 10 characters of many of the PDF files.

Every web application framework that allows file uploads has to decode these... so its a known problem. But I cannot find a way to decode these strings with either Python (my language of choice), or with some kind of command line decoding tool...

file doesnt recognise them.

hachoir doesnt recognise them.

Are there any simple tools I can just install, I dont care if they are in C, Perl, Python, Ruby, JavaScript or Mabolge, I just want a tool that can take the encoded string as input (file, stdin, I don't care) and output the decoded original files.

Or am I overthinking the algorithm to decode these files and it would be simpler than it looks and someone can show me how to decode them using pure python?

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
Techdragon
  • 502
  • 8
  • 15
  • Most commonly used encoding algorithm to represent binary data as text is Base64. I just did a quick test on a PDF file and I got exactly the same header character sequence when Base64-encoding it. So, you're basically looking for a Base64 decoder. – BalusC Jul 16 '15 at 06:58
  • @BalusC Thats the ticket. If you post your advice about the base64 encoding as an answer, can accept it as the answer to this question. – Techdragon Jul 16 '15 at 07:07

1 Answers1

1

Most commonly used encoding algorithm to represent binary data as text is Base64. I just did a quick test on a PDF file in Java and I got exactly the same header character sequence when Base64-encoding it.

byte[] bytes = Files.readAllBytes(Paths.get("/test/test.pdf"));
String base64 = DatatypeConverter.printBase64Binary(bytes);
System.out.println(base64.substring(0, 10)); // JVBERi0xLj

So, you're most likely looking for a Base64 decoder.

I don't do Python, so here's a Google search suggestion and the first Stack Overflow link which appeared in the search results to date: Python base64 data decode.

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555