2

I need to duplicate various kinds of file types, change them a bit so that the original's md5 hash won't match the modified one, but keep them readable and not corrupted.

TXT files - that's obvious. I just add a random string to the end of the file.

PDF file - well I started looking for a java library to edit pdf files, but then I accidentally tried to open a pdf file in notepad++, and thought - why don't I try to add a random string to the end of the not readable content that I see there. Well, to my surprise it worked and the file wasn't corrupted.

ZIP file - I've tried the same that I did with pdf, and it also worked.

DOCX- the same method stopped working here. Appending just a space (" ") at the end of the binary content of a docx file that I open in a text editor, corrupts the file.

So what I need is:

  1. java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.

  2. There are still file types that I need to change there md5 hash output, but I don't think they are modifiable in java - media files for example, executables and etc.. So, nevertheless, how can i perform what I want on these files? Is there a way to just "touch" the file, change a header or something and make it nonidentical to an untouched one?

edit: Ok, here's the motivation - I want to generate massive amount of data as I asked here: How to produce massive amount of data?

At the time of that question, the answers I got there were enough, but not they dont.

  1. I need the data to be nonidentical. Pairs of files must fail md5 hash test.

  2. i can't just generate random strings, because I need to simulate real files and documnets.

  3. I can't use existing data dumps, because I need various sizes of these data sets that include various file types. I need something that I'll give as an input the size, and it will generate the data for me.

So I figured that I should use a starting data set of all the file types that I eventually need, and just duplicate this data set.

Community
  • 1
  • 1
AAaa
  • 3,659
  • 7
  • 35
  • 40
  • 1.) java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx. Here you go http://poi.apache.org/ – bpgergo Jan 17 '12 at 17:49
  • Out of curiosity, what's your reason for doing this? You're not adding or changing the _content_ of these files, just changing the the md5 hash, thus defeating the use md5 hash do detect probable duplicate files. You sure there isn't another way to do what you want? – CPerkins Jan 17 '12 at 17:53
  • I edited the question with the motivation – AAaa Jan 17 '12 at 18:14
  • If you need a large dataset of various real files, why can't you just Google [`some data filetype:pdf`](https://www.google.com/search?btnG=1&pws=0&q=some+data+filetype%3Apdf) them and batch download? – Tomasz Nurkiewicz Jan 17 '12 at 18:21
  • @Tomasz Nurkiewicz -1. I need various of file types, not just pdf. 2. I need something that will automatically generate the datasets by a given target size. I can't rely on the internet, so a generator that will download the amount that I need won't suit. – AAaa Jan 17 '12 at 18:25
  • @AAaa: I still claim that you can easily find any file type of almost any size on the Internet, but I get your point. – Tomasz Nurkiewicz Jan 17 '12 at 18:27

1 Answers1

3
  1. java libraries for modifying office documents :doc, docx, xls, xlsx, ppt, pptx.

Apache POI is used to modify MS Office files. Note that newer formats (xlsx, docx, etc.) are simply ZIP files containing XML. Unzipping them and modifying plain text XML might work as well.

The same advice goes to ZIP files: try unzipping and modifying the easiest file.

But what are you actually trying to achieve? Note that randomly attaching some string at the end of the file works only by chance. On other computer or other version of software the file might be considered as corrupted...

I would advice you to either store some metadata external to the file rather than comparing MD5 or look deeper into file formats. There are almost always headers and various pieces of metadata hidden in the file (ID3 tags in MP3, EXIF in images, etc.) It is much safer to modify it instead.

Also look for reserved/not used bytes - it is quite often. But again - why? are you doing it on the first place?

Tomasz Nurkiewicz
  • 334,321
  • 69
  • 703
  • 674
  • I edited the question with the motivation. I'm not the one who used MD5 to compare the files. I have a given app that gets files as input and produces an output. it used MD5 to remove duplicate files. – AAaa Jan 17 '12 at 18:17