4

Below is my code to find the occurrences of all the substrings in a given single string

public static void main(String... args) {
    String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
    String[] severalStringArray = { "one", "two", "three", "four" };
    Map<String, Integer> countMap = countWords(fullString, severalStringArray);
}

public static Map<String, Integer> countWords(String fullString, String[] severalStringArray) {
    Map<String, Integer> countMap = new HashMap<>();

    for (String searchString : severalStringArray) {
        if (countMap.containsKey(searchString)) {
            int searchCount = countMatchesInString(fullString, searchString);
            countMap.put(searchString, countMap.get(searchString) + searchCount);
        } else
            countMap.put(searchString, countMatchesInString(fullString, searchString));
    }

    return countMap;
}

private static int countMatchesInString(String fullString, String subString) {
    int count = 0;
    int pos = fullString.indexOf(subString);
    while (pos > -1) {
        count++;
        pos = fullString.indexOf(subString, pos + 1);
    }
    return count;
}

Assume the full string might be a full file read as a string. Is the above is the efficient way of search or any other better way or fastest way to do it?

Thanks

Oleg Cherednik
  • 17,377
  • 4
  • 21
  • 35
  • You can look for `Trie` data structure to reduce time complexity – Ashishkumar Singh Aug 31 '21 at 07:26
  • For string matching you can use Knuth–Morris–Pratt algorithm . – Deepeshkumar Aug 31 '21 at 07:31
  • 1
    To be clear, the question you asked is about *counting* the number of times multiple strings appear in another string. This means that simple solutions involving regexes, etcetera do not work. – Stephen C Aug 31 '21 at 07:45
  • Also ... beware of the case where the search strings overlap; e.g. `{"one","onerous"}`. This pretty much precludes using a regex with alternates. – Stephen C Aug 31 '21 at 07:49
  • @Deepeshkumar We need a example with the code snippet. Algorithm is too complex for a developer to understand and implement. If you share an example it would be great to comprehend or understand. – integ specialist Aug 31 '21 at 17:11
  • @integspecialist Yes you are right. I am posting the link to the code for the algorithm, the way I understood and modified and implemented. Link : https://codereview.stackexchange.com/q/265575/175198 – Deepeshkumar Aug 31 '21 at 17:35

4 Answers4

3

You could just form a regex alternation of words to search, and then do a single search against that regex:

public static int matchesInString(String fullString, String regex) {
    int count = 0;

    Pattern r = Pattern.compile(regex);
    Matcher m = r.matcher(fullString);

    while (m.find())
        ++count;

    return count;
}

String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
String[] severalStringArray = { "one", "two", "three", "four" };
String regex = "\\b(?:" + String.join("|", severalStringArray) + ")\\b";

int count = matchesInString(fullString, regex);
System.out.println("There were " + count + " matches in the input");

This prints:

There were 8 matches in the input

Note that the regex pattern used in the above example was:

\b(?:one|two|three|four)\b
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • i tried Regex and Pattern and found to be slow when working on huge file string. Did you get the speed when dealing with large file as well ? – integ specialist Aug 31 '21 at 17:12
  • The question is whether indexOf or Regex or contains - which one searches fast in java in huge string should be explored or verified. – integ specialist Aug 31 '21 at 17:13
1

Regular expressions

Your problem can be solved using regex (regular expressions). Regular expressions are a tool that help you matching patterns in strings. This pattern can be a word or can be a set of chars.

Regular expressions in Java

In Java there are two Objects helping you with regular expressions: Pattern and Matcher.

Below you can see an example for searching if the word stackoverflow exists in the string stackoverflowXstackoverflowXXXstackoverflowXX in Java.

String pattern = "stackoverflow";
String stringToExamine = "stackoverflowXstackoverflowXXXstackoverflowXX";

Pattern patternObj = Pattern.compile(pattern);
Matcher matcherObj = patternObj.matcher(stringToExamine);

Counting how many occurrencies of a word in a given string

As written here you have different solution based on your Java version:

Java 9+

long matches = matcherObj.results().count();

Older Java versions

int count = 0;
while (matcherObj.find())
    count++;

Regular expressions in your problem

You use a method for calculating how many times a word is occurring in a text (a string), and you can modify it like this:

Java 9+

public static int matchesInString(String fullString, String pattern)
{
    Pattern patternObj = Pattern.compile(pattern);
    Matcher matcherObj = patternObj.matcher(fullString);
    
    return matcherObj.results().count();
}

Older Java versions

public static int matchesInString(String fullString, String pattern)
{
    int count = 0;

    Pattern patternObj = Pattern.compile(pattern);
    Matcher matcherObj = patternObj.matcher(fullString);
    
    while (matcherObj.find())
        count++;
        
    return count;
}
0

Actually, the fastest way is to scan the string first and count all existed words and save it into Map. Then select required words only.

Just be simple! The regular expression is too complicated and not efficient for this simple task. Let's solve it with a hummer!

public static void main(String... args) {
    String str = "one is a good one. two is ok. three is three. four is four. five is not four";
    Set<String> words = Set.of("one", "two", "three", "four");
    Map<String, Integer> map = countWords(str, words);
}

public static Map<String, Integer> countWords(String str, Set<String> words) {
    Map<String, Integer> map = new HashMap<>();

    for (int i = 0, j = 0; j <= str.length(); j++) {
        char ch = j == str.length() ? '\0' : str.charAt(j);

        if (j == str.length() || !isWordSymbol(ch)) {
            String word = str.substring(i, j);

            if (!word.isEmpty() && words.contains(word))
                map.put(word, map.getOrDefault(word, 0) + 1);

            i = j + 1;
        }
    }

    return map;
}

private static boolean isWordSymbol(char ch) {
    return Character.isLetter(ch) || ch == '-' || ch == '_';
}
Oleg Cherednik
  • 17,377
  • 4
  • 21
  • 35
  • Is creating a substring will not create an immutable object? Does it would be really performance better compared to indexOf? What are the issue the above solves instead of going for indexOf – integ specialist Sep 01 '21 at 03:26
  • `regexp` is too complicated. `indexOf(char)` I think is not fit, because you could have many letters and many whitespace characters. – Oleg Cherednik Sep 01 '21 at 13:33
0

An implementation of the Trie tree that someone commented on. I don't know if it's fast or not.

static class Trie {

    static final long INC_NODE_NO = 1L << Integer.SIZE;

    private long nextNodeNo = 0;
    private Node root = new Node();
    private final Map<Long, Node> nodes = new HashMap<>();

    public void put(String word) {
        Node node = root;
        for (int i = 0, len = word.length(); i < len; ++i)
            node = node.put(word.charAt(i));
        node.data = word;
    }

    public List<String> findPrefix(String text, int start) {
        List<String> result = new ArrayList<>();
        Node node = root;
        for (int i = start, length = text.length(); i < length; ++i) {
            if ((node = node.get(text.charAt(i))) == null)
                break;
            String v = node.data;
            if (v != null)
                result.add(v);
        }
        return result;
    }

    public Map<String, Integer> find(String text) {
        Map<String, Integer> result = new HashMap<>();
        for (int i = 0, length = text.length(); i < length; ++i)
            for (String w : findPrefix(text, i))
                result.compute(w, (k, v) -> v == null ? 1 : v + 1);
        return result;
    }

    class Node {
        final long no;
        String data;

        Node() {
            this.no = nextNodeNo;
            nextNodeNo += INC_NODE_NO;
        }

        Node get(int key) {
            return nodes.get(no | key);
        }

        Node put(int key) {
            return nodes.computeIfAbsent(no | key, k -> new Node());
        }
    }
}

public static void main(String args[]) throws IOException {
    String fullString = "one is a good one. two is ok. three is three. four is four. five is not four";
    String[] severalStringArray = { "one", "two", "three", "four" };
    Trie trie = new Trie();
    for (String word : severalStringArray)
        trie.put(word);
    Map<String, Integer> count = trie.find(fullString);
    System.out.println(count);
}

output:

{four=3, one=2, three=2, two=1}