I want to compare two files or text boxes to find the degree of similarity between them

Question

Sorry for the disturbance. I've removed the code and edited the post...

Real problem is I'm trying to find out the degree of similarity or plagiarism act between two texts or files. how can I do that? If you guide me ...

I need the code for the above algorithm to be included in my project.

using visual studio 2013 ... c#

EDITED: k so far I've done this ...

        int i = 0;
        int j = 0;
        long lena1 = txtFile1.Text.Length;
        long lenb1 = lena1;
        long len2 = txtFile2.Text.Length;
        string str1 = txtFile1.Text;
        string str2 = txtFile2.Text;
        string str3;
        bool match = false;
        int count = 0;
        int nowords1 = 0;
        int nowords2 = 0;
        string str4;
        int k = 0;
        int m = 0;
        int nowords_match = 0;


        char[] array1 = str1.ToArray();
        char[] array2 = str2.ToArray();
        int[] loc1 = new int[1048576];
        int[] loc2 = new int[1048576];

        while (i < array1.Length)
        {
            if (array1[i] == ' ')
            {
                nowords1++;
                loc1[j] = i;
                j++;
            }

            i++;

        }

        i = j = 0;

        while (i < array2.Length)
        {

            if (array2[i] == ' ')
            {
                nowords2++;
                loc2[j] = i;
                j++;
            }

            i++;

        }

        i = j = 0;
        m = 0;

        for (k = 0; k < loc1.Length-2; k++)
        {
            str3 = str1.Substring(loc1[m], loc1[m + 1] - loc1[m]);
            match = true;

            if (match == true && count > 3)
            {
               txtPlagiarism.Text += " " + loc1[i-3] + loc1[i-2] + " " + loc1[i];
            }

            else
            {
                count = 0;
                match = false;
            }

            j = 0;
            i = 0;

            while (i < nowords2)
            {

                if (j != nowords2)
                {
                    str4 = str2.Substring(loc2[j], loc2[j + 1] - (loc2[j]));
                }

                else
                {
                    break;
                }

                if (str4.Equals(str3)) 
                {
                    nowords_match++;
                    count ++;
                }

                j++;
                i++;

            }

            m++;

        }

I'm just counting the number of words matched so that I can pick that number of words from the first_file text to the copy-case text. but I'm getting a run-time error in it.

**System.ArgumentOutOfRangeException was unhandled
  HResult=-2146233086
  Message=Length cannot be less than zero.
Parameter name: length
  Source=mscorlib
  ParamName=length
  StackTrace:
       at System.String.InternalSubStringWithChecks(Int32 startIndex, Int32 length, Boolean fAlwaysCopy)
   at System.String.Substring(Int32 startIndex, Int32 length)
   at Calculate_File_Checksum.Form1.btnDetectPlagiairism_Click(Object sender, EventArgs e) in c:\Users\BLOOM\Documents\Visual Studio 2013\App2Test\Calculate_File_Checksum\Calculate_File_Checksum\Form1.cs:line 363
   at System.Windows.Forms.Control.OnClick(EventArgs e)
   at System.Windows.Forms.Button.OnClick(EventArgs e)
   at System.Windows.Forms.Button.OnMouseUp(MouseEventArgs mevent)
   at System.Windows.Forms.Control.WmMouseUp(Message& m, MouseButtons button, Int32 clicks)
   at System.Windows.Forms.Control.WndProc(Message& m)
   at System.Windows.Forms.ButtonBase.WndProc(Message& m)
   at System.Windows.Forms.Button.WndProc(Message& m)
   at System.Windows.Forms.Control.ControlNativeWindow.OnMessage(Message& m)
   at System.Windows.Forms.Control.ControlNativeWindow.WndProc(Message& m)
   at System.Windows.Forms.NativeWindow.DebuggableCallback(IntPtr hWnd, Int32 msg, IntPtr wparam, IntPtr lparam)
   at System.Windows.Forms.UnsafeNativeMethods.DispatchMessageW(MSG& msg)
   at System.Windows.Forms.Application.ComponentManager.System.Windows.Forms.UnsafeNativeMethods.IMsoComponentManager.FPushMessageLoop(IntPtr dwComponentID, Int32 reason, Int32 pvLoopData)
   at System.Windows.Forms.Application.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)
   at System.Windows.Forms.Application.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)
   at System.Windows.Forms.Application.Run(Form mainForm)
   at Calculate_File_Checksum.Program.Main() in c:\Users\BLOOM\Documents\Visual Studio 2013\App2Test\Calculate_File_Checksum\Calculate_File_Checksum\Program.cs:line 19
   at System.AppDomain._nExecuteAssembly(RuntimeAssembly assembly, String[] args)
   at System.AppDomain.ExecuteAssembly(String assemblyFile, Evidence assemblySecurity, String[] args)
   at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()
  InnerException:**

I don't understand why it is going so ?? because I've given the correct values in it ... please help anyone.

Apparently what you need is to design an algorithm for comparing two strings of text. Posting your current application code is pointless because it simply opens two text files. Probably you need to remove the code and the "c#" tag and add the "algorithm" tag. — Goblin Alchemist, Dec 25 '15 at 11:43
how can I check for similarity between two text boxes?? which algorithm to use and how? — Blooming Bloom, Dec 25 '15 at 14:02
I have created a class library that *may* be of use, but it won't detect moved pieces very well. It is created to handle "diffing" of text files, you can find it as a nuget package, called "DiffLib". I have ideas that may work for a more general "find all duplicated portions" diff output but have not implemented or published any code for that yet. The source code is available on github: https://github.com/lassevk/DiffLib — Lasse V. Karlsen, Dec 25 '15 at 18:57

score 1 · Answer 1 · edited May 23 '17 at 10:27

1

There are numerous ways to compare the similarity of strings. Here's an algorithm Martin put together for the Levenshtein distance

edited May 23 '17 at 10:27

Community

1
1

answered Dec 25 '15 at 18:53

Fidel

7,027
11
57
81

score 0 · Answer 2 · answered Dec 25 '15 at 15:37

In one of my projects I had to detect changes in an array of objects and determine which objects were inserted and which removed. Probably this algorithm can be used for your task. Here is some pseudocode, you can adapt it for C#.

The simplest thing is to compare strings character-by-character. If you find that it doesn't work well you can try to compare word-by-word, line-by-line or paragraph-by-paragraph.

The idea is:

Create two "pointers" each pointing to the start of the corresponding string.
If the characters (or words or whatever) under these pointers are different, advance both pointers by one position (one character or one word etc.)
When the pointer moves, it stores all spanned characters (or words etc.) in a map (dictionary).
After each step, each pointer checks if its current character (or word) exists in another pointer's map.
If such one is found, then it's a coincidence found after some amount of differences. In this case, we mark this character (word) as coincident, increase the number of found coincident characters (words), move both pointers at this character (word), and clear both maps to search for the next coincidence from scratch.
Otherwise, move both pointers one step further and repeat the checks.

Beware that character-by-character search may result in individual characters being detected and the result may be strange. But anyway, comparison of arbitrary strings is not an easy task.

String str1="...";
String str2="...";

int i=-1,j=-1,p=-1,q=-1,coincidence=0;
TreeMap<char,int> map1=new TreeMap<char,int>();
TreeMap<char,int> map2=new TreeMap<char,int>();

while(true){

    if(i<str1.length) i++;
    if(j<str2.length) j++;

    char c1 = (i==str1.length) ? 0 : str1[i] ;
    char c2 = (j==str2.length) ? 0 : str2[j] ;

    if(c1!=0) map1.put(c1,i);
    if(c2!=0) map2.put(c2,j);

    int i2 = (c2==0) ? 0 : map1.get(c2) ;
    int j2 = (c1==0) ? 0 : map2.get(c1) ;

    if( i2!=0 || j2!=0 || ( i==str1.length && j==str2.length ) ){

        if(i2!=0) j2=0;

        if(i2!=0) i=i2;
        if(j2!=0) j=j2;

        if( i2!=0 || j2!=0 ) coincidence++;

        p=i;
        q=j;
        map1.clear();
        map2.clear();

    }

    if( i==str1.length && j==str2.length ) break;
}

print("Length of text 1: "+str1.length);
print("Length of text 2: "+str2.length);
print("Amount of coincidence: "+coincidence);

Thanks for the logic @GoblinAlchemist. It'll be great help but I need explanation according to C# if you can guide me because I'm a newbie in C# so can't understand the full concept of above code. — Blooming Bloom, Dec 25 '15 at 19:00
@BloomingBloom, my algorithm was initially in Java, and I don't have C# at hand to compile and check it. If you have some sort of assignment, then it will be a good idea to convert this code to C# yourself. Feel free to ask any specific questions. — Goblin Alchemist, Dec 28 '15 at 09:46
@BloomingBloom, the concept of my algorithm is outlined above the code itself. This is a fast approximation of the Levenshtein distance, it may be reporting less coincidences, but its time complexity is O(n log n), and it requires O(n) memory. Also take a look at Fidel's answer, there is a link to a working C# implementation of Levenshtein distance. — Goblin Alchemist, Dec 28 '15 at 09:57

I want to compare two files or text boxes to find the degree of similarity between them

2 Answers2

Linked