Finding percentage of similarity among some text files

Question

I have made a program in C++ which generates a text file based on a sequence of values of an integer variable v varying between 1 and 100. The text file format is as follows:

file1.txt
1 2 3 4 5

file2.txt
4 5 6 7 8

file3.txt
8 4 5 7 1

.......

Say, I have generated 100 text files (file1.txt to file100.txt). I want to check the text files one by one and find the percentage of similarities between them. I don't want to check all the 100 text files rather I want to stop my checking when I am getting more or less similar result for some consecutive files.

How can I perform the check for similarity? say, I have calculated the percentage of similarity between file1 and file2. Now will I calculate the similarity for file2 and file3 or file1 and file3 and so on? To be more precise, what is the logic of performing this checking?

How do you consider similarity? Is it share some letters ? Share some words ? Or is order relevant as well ? E.g. Is 12345 more similar to 54321 or to 12457 ? — Christophe, Nov 08 '16 at 07:00
12345 more similar to 54321. by similarity, i mean to say both files will contain maximum same digits @Christophe — , Nov 08 '16 at 08:34
Why is there the tag standard-deviation ? Do you intend to do statistical calculation of standard deviations ? — Christophe, Nov 08 '16 at 20:19
You'll have to define "similar" more precisely, and you also need to define what you want your output to be. Do you want to know which file is most similar to file1, which is most similar to file2, etc., so that you'd have a list that shows each file and which file is most similar to it? What do you do if files 1, 17, and 26 are identical? What if file 1 contains `12345`, and none of the other files contain any of those numbers? Without clearly specifying your output, any advice we give you will likely be wrong. — Jim Mischel, Nov 08 '16 at 20:25

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

According to your comments, the degree of similarity is calculated based on the common number of digits, regardless of their order.

Similarity between two files

The easiest way to do it, would be to load two consecutive files (say two open ifstream sfs1 and sts2) into two vectors:

std::vector<int> v1{1,2,3,4,5,6,7,8};

    copy(istream_iterator<int>(sfs1), istream_iterator<int>(), back_inserter(v1));
    copy(istream_iterator<int>(sfs2), istream_iterator<int>(), back_inserter(v2));

Sort the vectors:

    sort(v1.begin(), v1.end()); 
    sort(v2.begin(), v2.end());

Then take the intersection of the two sorted vectors using the standard algorithm:

    set_intersection(v1.cbegin(), v1.cend(), v2.cbegin(), v2.cend(), back_inserter(sim));

You then just have to look for the sizes:

    cout << "Similar elements: " << sim.size()<<endl; 
    cout << "Similarity coefficient: "<< (double)sim.size()/max(v1.size(), v2.size())*100 <<"%"<<endl;

Now you have to add some error handling in case both vectors would be empty (which would lead to a divide by 0 here).

Here an online demo using stringstreams instead of filestreams:

How to do for several files

According to your question, you don't need to look for similarities in each possible pair of files, but only between subsequent files.

So after you have compared the two first files, you just need to copy v2 into v1, read the next file into v2 and sort it. And calculate new similarity.

You also need to have a counter to count consecutive near-matches. Increment it, every time the similarity is beyond a certain threshold (e.g. 90%). Reset it to 0 every time the similarity is below. As soon as your counter reaches the number of consecutive near-matches that you expect, just stop :-)

Finding percentage of similarity among some text files

1 Answers1

Similarity between two files

How to do for several files