I have to search in multiple LARGE xml files (almost 20000 files) for a list of name which has almost 200000 name in it - a name may have multiple words, braces, quotes etc) [eg: Royal enterprise (Incorporated)]. Need to findout from each file if it has any EXACT matches from 200000 names.
I can loop the names in every file and search using String.contains(). This serach is faster but result is not accurate as it will search part of word.
eg: "concatenation".contains("cat") gives true
but expected result is 'false' as "cat" is not exact match
- I can use regex. The result is accurate, but performance is poor.
For example, when I search these 200000 names in a file ,
i) using String.contains() takes --> 5 sec
ii) using regex takes --> 340 sec
public Set<String> isContainExactWords(Map<Integer, String> name, String searchFile) {
/*
* xml has been parsed and passing CDATA as String
*/
Set<String> sNamesFound = new TreeSet<String>();
for (int count = 1; count <= name.size(); count++) {
String pattern = "(?<!\\S)" + Pattern.quote(name.get(count)) + "(?!\\S)";
Pattern p = Pattern.compile(pattern);
if(p.matcher(searchFile).find()== true) {
sNamesFound.add(name.get(count));
}
}
return sNamesFound;
}
- The files are lagre xml file, and most of them are 200 KB, which has around 20000 files.
- Search items has 200000 elements.
I need better performance while searching the exact match from the file.