i was given a list that contains over 90,000 names. i am to check the names that have >= 50% similarity, and write the result to a file in the format:
ID 1, ID 2, Similarity percent.
i already have an algorithm that checks the similarity, but iterating through the whole list takes alot of time. Can someone help out with a fast algorithm to compare the names?
below is the code
public static void main(String[] args) throws IOException {
List<String> list = new ArrayList<>();
int count = 0;
FileWriter f = new FileWriter(new File("output.txt"));
StringBuilder str = new StringBuilder();
Scanner scanner = new Scanner(new File("name.csv"));
while (scanner.hasNextLine()) {
count++;
list.add(scanner.nextLine());
}
long start = System.currentTimeMillis();
//////////////////////////////////////////////////////////
for (int i = 0; i < list.size(); i++) {
for (int j = i + 1; j < list.size(); j++) {
int percent = StringSimilarity.simi(list.get(i), list.get(j));
if (percent >= 50) {
str.append("ID " + i + ",ID " + j + "," + percent + " percent");
str.append("\n");
}
}
}
////////////////////////////////////////////////////////
long end = System.currentTimeMillis();
f.write(str.toString());
System.out.println((end - start) / 1000 + " second(s)");
f.close();
scanner.close();
}
public static String getString(String s) {
Pattern pattern = Pattern.compile("[^a-z A-Z]");
Matcher matcher = pattern.matcher(s);
String number = matcher.replaceAll("");
return number;
}
This is a sample of how the data looks.....the names are stored in a . csv file, so I read the file and stored the names in the list.
FIRST NAME,SURNAME,OTHER NAME,MOTHER's MAIDEN NAME
Kingsley, eze, Ben, cici
Eze, Daniel, Ben, julie
Jon, Smith, kelly, Joe
Joseph, tan, , chellie
Joseph,tan,jese,chellie
....and so on A person can have 3 NAMEs at least.....like I stated earlier, the program is to check how similar the names are, so when comparing Id 1 and id 2, "ben" is common and "eze" is common, so they have a 50 percent similarity. Comparing id 4 and id 5, the similarity is 75percent....because they have 3 names in common even though id 4 doesn't have a 3rd name....
So the problem is...during the similarity check using the two for loops, I start with the 1st id and check it through the remaining 90,000 names and save the id's that it has >= 50 percent similarity with, then take the next id 2 and do same......and so on