-1

I have a startup project and it is about files. I tried to create an algorithm but I couldn't. I want to create an algorithm along with a database for these purposes

I have many file in my database (file or a relation to file doesn't matter) and when another file comes I want to compare that file with all of my files and if it has more than for example 80 % similarity I don't let that to save in my database else with that condition that it has less than 80 % similarity with all files I want to save it in my database.

Hamid Pourjam
  • 20,441
  • 9
  • 58
  • 74
  • define similarity: byte sequences, words? file types to support? without any code, this is off-topic here. – Cee McSharpface Mar 12 '17 at 21:40
  • @dlatikay dear friend i have problem with comparing one file with many it takes time my files are not stick to file type i want to execute in all file type but specially pdf and docx – Manoochehr Mojgani Mar 12 '17 at 22:09
  • You might want to look at [near duplicates detection](http://stackoverflow.com/a/23053827/572670), seems to fit well for you. – amit Mar 12 '17 at 22:21
  • @greybeard dear friend sorry about that i'm not good in english at all :-( – Manoochehr Mojgani Mar 13 '17 at 09:19
  • @amit thanks my friend it was helpful but not still my problem because one document is not just contain text and it is a file with images, encoded text and etc. But still thank u ;-) – Manoochehr Mojgani Mar 13 '17 at 09:24

1 Answers1

0

You should store 3 variables, Similarity, Size1 and Size2 (all initialized as 0).

Firstly, you start comparing char by char (or bit by bit or whatever you need) from the files, and as you read a char, if their chars are the same, you increase 1 in Similarity. After each char comparison, you increase 1 in Size1 and Size2.

You should run this comparisons until one of the files is over, then you just count the chars left in the bigger file and increase in Size1 or Size2.

Then, you divide Similarity by the size of the file in your database (be it Size1 or Size2) and see if it's 80% or more similar to the new one.

hope that helps :)

Daniel
  • 7,357
  • 7
  • 32
  • 84
  • thanks my friend for your answer but it works just for 2 file and if i want to run it in compare one file with many file ( actually Many files with many files ) it takes too much time but still thank you for your answer ;-) – Manoochehr Mojgani Mar 12 '17 at 22:15
  • well, you can hold the file you wanna check and compare it with everyone else (you can do optimizations, for example, if you compare 10 chars and they are all different, you skip this comparison) – Daniel Mar 12 '17 at 22:21
  • You are right but maybe some one just put some character ( byte ) in for example a pdf file and do upload. If it was between that range of byte that i checking it is problem – Manoochehr Mojgani Mar 13 '17 at 09:17
  • well maybe you can look for a comparison called "edit distance" and use it to compare your files, this way this byte that breaks the equality will somehow be taken in consideration but will not change the whole rest of the comparison :) although its not very efficient – Daniel Mar 14 '17 at 02:23
  • Thanks a lot it was so effective ;-) – Manoochehr Mojgani Mar 14 '17 at 16:56