I am new to java Can anyone help me with the code to tell how much 2 text files match with each other? Suppose i have two Files 'a.txt' and 'b.txt' then i need to know the percentage of match. thanks
-
By writing some code that performs that task? Your question is too broad. What specific part do you need help with? – PakkuDon Jun 02 '14 at 13:47
-
1Sounds almost like a school assignment. You could use two scanners, one for each file. Then you could compare individual characters to find matches/differences. – Kyte Jun 02 '14 at 13:50
-
Maybe you should start with just to `String` objects in order to narrow your definition of `match`. You should really specify what you would like it to mean. A character match, a word match or even a line match? – Patru Jun 02 '14 at 13:52
-
My guess is character match. I also think this is a homework assignment. I wouldn't solve that on this site, maybe on Yahoo Answers. – EpicPandaForce Jun 02 '14 at 14:03
-
1considering you are seeing your files as basically Strings, you would use the edit distance, normalized to the size of the first file. ( http://en.wikipedia.org/wiki/Edit_distance ) – njzk2 Jun 02 '14 at 14:04
-
2possible duplicate of [Similarity String Comparison in Java](http://stackoverflow.com/questions/955110/similarity-string-comparison-in-java) – Joe Jun 02 '14 at 14:06
4 Answers
Read in the two files to two Strings str1, str2.
Iterate through each, counting matching chars. Divide number of matches by number of compares, and multiply by 100 to get a percentage.
Scanner sca = new Scanner(new File ("a.txt"));
Scanner scb = new Scanner(new File ("b.txt"));
StringBuilder sba = new StringBuilder();
StringBuilder sbb = new StringBuilder();
while(sca.hasnext()){
sba.append(sca.next());
}
while(scb.hasnext()){
sbb.append(scb.next());
}
String a = sba.toString();
String b = sbb.toString();
int maxlen = Math.max(a.length,b.length);
int matches;
for(int i =0; i<maxlen; i++){
if(a.length <=i || b.length <=i){
break;
}
if(a.chatAt(i)==b.charAt(i)){
matches++;
}
return (((double)matches/(double)maxlen)*100.0)

- 3,616
- 23
- 36
-
-
3Trying not to do their homework for them, but I will toss some Java in the answer – Adam Yost Jun 02 '14 at 13:57
-
This: `a += sca.next();` is a *wery* bad idea, use `StringBuilder` instead. http://stackoverflow.com/questions/4645020/when-to-use-stringbuilder-in-java – kajacx Jun 02 '14 at 14:12
-
fixed to user StringBuilder, but that will be my last edit. I think I have made it clear enough that with minimal effort the problem can be solved/optimized – Adam Yost Jun 02 '14 at 14:16
-
1
import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.StringTokenizer;
class File_meta_Data // class to store the metadata of file so that scoring can be done
{
String FileName;
long lineNumber;
long Position_In_Line;
long Position_In_Document;
File_meta_Data()
{
FileName = null;
lineNumber = 0;
Position_In_Line = 0;
Position_In_Document = 0;
}
}
public class bluestackv1 {
static int getNumberofInputFiles() // seeks number of resource files from user
{
System.out.println("enter the number of files");
Scanner scan = new Scanner(System.in);
return(scan.nextInt());
}
static List getFiles(int Number_of_input_files) // seeks full path of resource files from user
{
Scanner scan = new Scanner(System.in);
List filename = new ArrayList();
int i;
for(i=0;i<Number_of_input_files;i++)
{
System.out.println("enter the filename");
filename.add(scan.next());
}
return(filename);
}
static String getfile() // seeks the full pathname of the file which has to be matched with resource files
{
System.out.println("enter the name of file to be matched");
Scanner scan = new Scanner(System.in);
return(scan.next());
}
static Map MakeIndex(List filename) // output the index in the map.
{
BufferedReader reader = null; //buffered reader to read file
int count;
Map index = new HashMap();
for(count=0;count<filename.size();count++) // for all files mentioned in the resource list create index of its contents
{
try {
reader = new BufferedReader(new FileReader((String) filename.get(count)));
long lineNumber;
lineNumber=0;
int Count_of_words_in_document;
Count_of_words_in_document = 0;
String line = reader.readLine(); // data is read line by line
while(line!=null)
{
StringTokenizer tokens = new StringTokenizer(line, " ");// here the delimiter is <space> bt it can be changed to <\n>,<\t>,<\r> etc depending on problem statement
lineNumber++;
long Count_of_words_in_line;
Count_of_words_in_line = 0;
while(tokens.hasMoreTokens())
{
List<File_meta_Data> temp = new ArrayList<File_meta_Data>();
String word = tokens.nextToken();
File_meta_Data metadata = new File_meta_Data();
Count_of_words_in_document++; // contains the word number in the document
Count_of_words_in_line++; // contains the word number in line. used for scoring
metadata.FileName = filename.get(count).toString();
metadata.lineNumber = lineNumber;
metadata.Position_In_Document = Count_of_words_in_document;
metadata.Position_In_Line = Count_of_words_in_line;
int occurence;
occurence=0;
if(index.containsKey(word)) //if the word has occured already then update the new entry which concatenates the older and new entries
{
Map temp7 = new HashMap();
temp7 = (Map) index.get(word);
if(temp7.containsKey(metadata.FileName)) // entry of child Map is changed
{
List<File_meta_Data> temp8 = new ArrayList<File_meta_Data>();
temp8 = (List<File_meta_Data>)temp7.get(metadata.FileName); //outputs fioles which contain the word along with its location
temp7.remove(metadata.FileName);
temp8.add(metadata);
temp7.put(metadata.FileName, temp8); // updated entry is added
}
else // if the word has occured for the first time and no entry is in the hashMap
{
temp.add(metadata);
temp7.put(metadata.FileName, temp);
temp=null;
}
Map temp9 = new HashMap();
temp9 = (Map) index.get(word);
index.remove(word);
temp9.putAll(temp7);
index.put(word, temp9);
}
else // similarly is done for parent map also
{
Map temp6 = new HashMap();
temp.add(metadata);
temp6.put(metadata.FileName, temp);
index.put(word,temp6);
}
}
line = reader.readLine();
}
index.put("@words_in_file:"+(String)filename.get(count),Count_of_words_in_document);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
return(index);
}
static String search(Map index,List filename) throws IOException //scores each resource file by comparing with each word in input file
{
double[] overlap = new double[filename.size()]; //stores overlap/coord scores
double[] sigma = new double[filename.size()]; // stores ∑t in q ( tf(t in d) · idf(t)^2 for each resource file
int i;
double max, maxid; // stores file info with max score
max=0;
maxid= -1;
for(i=0;i<filename.size();i++)
{
overlap[i] = 0;
sigma[i] = 0;
}
String bestfile = new String();
double maxscore;
maxscore = -1;
double total;
double cord;
total=0;
File File_to_be_matched = new File(getfile());
BufferedReader reader = new BufferedReader(new FileReader(File_to_be_matched));
String line = reader.readLine();
while(line!=null) //similar to index function
{
StringTokenizer tokens = new StringTokenizer(line, " ");
while(tokens.hasMoreTokens())
{
String word = tokens.nextToken();
double tf,idf;
tf = 0;
idf = 0;
total=total+1;
if(index.containsKey(word))
{
Map temp = new HashMap();
for(i=0;i<filename.size();i++) // for each file a score is calculated for corresponding word which afterwards added
{
int j,count,docFreq;
count=0;
docFreq=0;
temp = (Map) index.get(word);
if(temp.containsKey(filename.get(i)))
{
List l2= (List) temp.get(filename.get(i));
tf = (int) Math.pow((long) l2.size(),0.5); //calculate the term frequency
docFreq = temp.size(); // tells in how many files the word occurs in the file
overlap[i]++;
}
else
{
tf=0;
}
idf = (int) (1 + Math.log((long)(filename.size())/(1+docFreq)));// more the occurence higher similarity of file
sigma[i] = sigma[i] + (int)(Math.pow((long)idf,2) * tf);
}
}
}
line = reader.readLine();
}
double subsetRatio;
for(i=0;i<filename.size();i++) // all scores are added
{
int x = (int)index.get("@words_in_file:"+(String)filename.get(i));
subsetRatio = overlap[i]/x;
overlap[i] = overlap[i]/total;
overlap[i] = overlap[i] * sigma[i];
overlap[i] = overlap[i] * subsetRatio; // files which are subset of some have higher priority
if(max<overlap[i]) // maximum score is calculated
{
max=overlap[i];
maxid = i;
}
}
if(maxid!=-1)
return (String) (filename.get((int) maxid));
else
return("error: Matching does not took place");
}
public static void main(String[] args) throws IOException
{
List filename = new ArrayList();
int Number_of_input_files = getNumberofInputFiles();
filename = getFiles(Number_of_input_files);
Map index = new HashMap();
index = MakeIndex(filename);
//match(index);
while(1==1) //infinite loop
{
String Most_similar_file = search(index,filename);
System.out.println("the most similar file is : "+Most_similar_file);
}
}
}
The problem is to find the most similar file among several resource files. there are 2 sub-problems to this question first, as the question states, how to find the most similar file which is done by associating each file with a score by considering different aspects of the content of files second, to parse each and every word of the input file with a comparatively large resource files to solve the second problem, Reverse Indexing has been used with HashMaps in java. Since our problem was simple and not modifying i used Inherited Maps instead of Comparator based MapReduce while searching computing complexity = o(RESOURCEFILES * TOTAL_WORDS_IN _INPUTFILE) the first problem has been solved by following formula score(q,d) = coord(q,d) • ∑t in q ( tf(t in d) • idf(t)^2) . subsetRatio 1) coord(q,d) = overlap / maxOverlap Implication: of the terms in the query, a document that contains more terms will have a higher score Rational : Score factor based on how many of the query terms are found in the specified document 2) tf(t in d) = sqrt(freq) Term frequency factor for the term (t) in the document (d). Implication: the more frequent a term occurs in a document, the greater its score Rationale: documents which contains more of a term are generally more relevant 3) idf(t) = log(numDocs/(docFreq+1)) + 1 I implication: the greater the occurrence of a term in different documents, the lower its score Rational : common terms are less important than uncommon ones 4) SubsetRation = number of occuring words / total words implication : suppose 2 files, both superlative of input file then file with lesser excessive data will have hiegher similarity Rational : files with similar content must have higher priority
****************test cases************************
1) input file has no similar word than the resource files 2) input file is similar in content to any one of the file 3) input file is similar in content but different in metadata(meaning position of words is not similar) 4) input file is a subset of resource files 5) input file contains very common words like all 'a' or 'and' 6) input file is not at the location 7) input file cannot be read
Look into opening files, reading them as characters. You actually just need to get a char from each, then check if they match. If they match, then increment the total counter and the match counter. If they don't, only the total counter.
Read more on handling files and streams here: http://docs.oracle.com/javase/tutorial/essential/io/charstreams.html
An example would be this:
BufferedReader br1 = null;
BufferedReader br2 = null;
try
{
br1 = new BufferedReader(new InputStreamReader(new FileInputStream(new File("a.txt")), "UTF-8"));
br2 = new BufferedReader(new InputStreamReader(new FileInputStream(new File("b.txt")), "UTF-8"));
//add logic here
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (br1 != null)
{
try
{
br1.close();
}
catch (Exception e)
{
}
}
if (br2 != null)
{
try
{
br2.close();
}
catch (Exception e)
{
}
}
}

- 79,669
- 27
- 256
- 428
-
2I'm not going to solve the problem completely, this is obviously a homework assignment. If there is anything wrong with my fundamental logic, please notify me. If the problem is that I used pre-1.7 IO, they never specified the Java version used. – EpicPandaForce Jun 02 '14 at 14:14
-
1Which logic? You only create two buffers in 35 lines of code and you are not even reading them. And if you don't want to answer the question, then don't answer the question. – Thomas Uhrig Jun 02 '14 at 16:14
-
I assumed he had trouble with processing files, not with Java coding fundamentals. I might have been wrong. The last sentence, I agree with it in retrospect. I'll keep the answer here because I close the streams properly, some people don't know how to do that because it's always omitted. – EpicPandaForce Jun 02 '14 at 16:20