1

I have to search in multiple LARGE xml files (almost 20000 files) for a list of name which has almost 200000 name in it - a name may have multiple words, braces, quotes etc) [eg: Royal enterprise (Incorporated)]. Need to findout from each file if it has any EXACT matches from 200000 names.

  1. I can loop the names in every file and search using String.contains(). This serach is faster but result is not accurate as it will search part of word.

    eg: "concatenation".contains("cat") gives true

but expected result is 'false' as "cat" is not exact match

  1. I can use regex. The result is accurate, but performance is poor.

For example, when I search these 200000 names in a file ,

 i) using String.contains()  takes --> 5 sec
 ii) using regex takes --> 340 sec



public Set<String> isContainExactWords(Map<Integer, String> name, String searchFile) {
    /*
     * xml has been parsed and passing CDATA as String
     */
    Set<String> sNamesFound = new TreeSet<String>();
    
    for (int count = 1; count <= name.size(); count++) {
        String pattern = "(?<!\\S)" + Pattern.quote(name.get(count)) + "(?!\\S)";
        
        Pattern p = Pattern.compile(pattern);
         if(p.matcher(searchFile).find()== true) {
             sNamesFound.add(name.get(count));
         }
    }
    return sNamesFound;
}
  1. The files are lagre xml file, and most of them are 200 KB, which has around 20000 files.
  2. Search items has 200000 elements.

I need better performance while searching the exact match from the file.

Community
  • 1
  • 1
jaco
  • 31
  • 3
  • 1
    My regex is not so good. Could you give an example of the files content? How are the strings saved there? Are they properties of an xml tag? – findusl Nov 09 '19 at 11:38
  • You aren't searching files here, you are searching strings in memory. If the names constitute an entire element name or attribute you might be better off using XPath. – user207421 Nov 09 '19 at 11:40
  • Make the compiled pattern a class constant. (`seachFile` lacks an r.) Get a `matcher` *once* and use [`find(int start)`](https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/regex/Matcher.html#find(int)). – greybeard Nov 09 '19 at 11:46
  • @greybeard corrected. How can I use matcher and find(int start) Any examples? – jaco Nov 09 '19 at 12:00
  • I do *not* think this question to be a duplicate of [Create array of regex matches](https://stackoverflow.com/q/6020384): This question asks to create sets of strings matched from a large set of literals, the supposed duplicate Q&A is about an array/a collection of matches from a *single* RegEx, not precluding duplicates. Moreover, while it is possible to construct a RegEx from 200000 literals, java.util.regex may not be the tool to use. – greybeard Nov 09 '19 at 13:44
  • @greybeard is there any other ways to improve the performance and get the expected result ? any work around other than regex? – jaco Nov 09 '19 at 13:59
  • (In a hurry, I thought it would be feasible to construct a RegEx to *tokenise* each file contents and look each token up in `Map<>names` - not so, starting with `Mapname`.) Preprocessing: Construct a RegEx very similar to the way you suggested, but giving each and every *name* as an alternative ('|'). Have a Pattern compiled from that. File content processing: get a `Matcher` from that pattern, and have the match [stream collected into a set](https://docs.oracle.com/en/java/javase/13/docs/api/java.base/java/util/stream/Collectors.html#toSet()) much like suggested in the duplicate. – greybeard Nov 09 '19 at 14:07

0 Answers0