12

I'm building a classifier which has to read through a lot of textdocuments, but I found out that my countWordFrequenties method gets slower the more documents it has processed. This method underneath takes 60ms (on my PC), while reading, normalizing, tokenizing, updating my vocabulary and equalizing of different lists of integers only takes 3-5ms in total (on my PC). My countWordFrequencies method is as follows:

public List<Integer> countWordFrequencies(String[] tokens) 
{
    List<Integer> wordFreqs = new ArrayList<>(vocabulary.size());
    int counter = 0;

    for (int i = 0; i < vocabulary.size(); i++) 
    {
        for (int j = 0; j < tokens.length; j++)
            if (tokens[j].equals(vocabulary.get(i)))
                counter++;

        wordFreqs.add(i, counter);
        counter = 0;
    }

    return wordFreqs;
}

What is the best way for me to speed this process up? What is the problem of this method?

This is my entire Class, there is another Class Category, is it a good idea to post this also here or don't you guys need it?

public class BayesianClassifier 
{
    private Map<String,Integer>  vocabularyWordFrequencies;
    private List<String> vocabulary;
    private List<Category> categories;
    private List<Integer> wordFrequencies;
    private int trainTextAmount;
    private int testTextAmount;
    private GUI gui;

    public BayesianClassifier() 
    {
        this.vocabulary = new ArrayList<>();
        this.categories = new ArrayList<>();
        this.wordFrequencies = new ArrayList<>();
        this.trainTextAmount = 0;
        this.gui = new GUI(this);
        this.testTextAmount = 0;
    }

    public List<Category> getCategories() 
    {
        return categories;
    }

    public List<String> getVocabulary() 
    {
        return this.vocabulary;
    }

    public List<Integer> getWordFrequencies() 
    {
        return  wordFrequencies;
    }

    public int getTextAmount() 
    {
        return testTextAmount + trainTextAmount;
    }

    public void updateWordFrequency(int index, Integer frequency)
    {
        equalizeIntList(wordFrequencies);
        this.wordFrequencies.set(index, wordFrequencies.get(index) + frequency);
    }

    public String readText(String path) 
    {
        BufferedReader br;
        String result = "";

        try 
        {
            br = new BufferedReader(new FileReader(path));

            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) 
            {
                sb.append(line);
                sb.append("\n");
                line = br.readLine();
            }

            result = sb.toString();
            br.close();
        } 
        catch (IOException e) 
        {
            e.printStackTrace();
        }

        return result;
    }

    public String normalizeText(String text) 
    {
        String fstNormalized = Normalizer.normalize(text, Normalizer.Form.NFD);

        fstNormalized = fstNormalized.replaceAll("[^\\p{ASCII}]","");
        fstNormalized = fstNormalized.toLowerCase();
        fstNormalized = fstNormalized.replace("\n","");
        fstNormalized = fstNormalized.replaceAll("[0-9]","");
        fstNormalized = fstNormalized.replaceAll("[/()!?;:,.%-]","");
        fstNormalized = fstNormalized.trim().replaceAll(" +", " ");

        return fstNormalized;
    }

    public String[] handleText(String path) 
    {
        String text = readText(path);
        String normalizedText = normalizeText(text);

        return tokenizeText(normalizedText);
    }

    public void createCategory(String name, BayesianClassifier bc) 
    {
        Category newCategory = new Category(name, bc);

        categories.add(newCategory);
    }

    public List<String> updateVocabulary(String[] tokens) 
    {
        for (int i = 0; i < tokens.length; i++)
            if (!vocabulary.contains(tokens[i]))
                vocabulary.add(tokens[i]);

        return vocabulary;
    }

    public List<Integer> countWordFrequencies(String[] tokens)
    {
        List<Integer> wordFreqs = new ArrayList<>(vocabulary.size());
        int counter = 0;

        for (int i = 0; i < vocabulary.size(); i++) 
        {
            for (int j = 0; j < tokens.length; j++)
                if (tokens[j].equals(vocabulary.get(i)))
                    counter++;

            wordFreqs.add(i, counter);
            counter = 0;
        }

        return wordFreqs;
    }

    public String[] tokenizeText(String normalizedText) 
    {
        return normalizedText.split(" ");
    }

    public void handleTrainDirectory(String folderPath, Category category) 
    {
        File folder = new File(folderPath);
        File[] listOfFiles = folder.listFiles();

        if (listOfFiles != null) 
        {
            for (File file : listOfFiles) 
            {
                if (file.isFile()) 
                {
                    handleTrainText(file.getPath(), category);
                }
            }
        } 
        else 
        {
            System.out.println("There are no files in the given folder" + " " + folderPath.toString());
        }
    }

    public void handleTrainText(String path, Category category) 
    {
        long startTime = System.currentTimeMillis();

        trainTextAmount++;

        String[] text = handleText(path);

        updateVocabulary(text);
        equalizeAllLists();

        List<Integer> wordFrequencies = countWordFrequencies(text);
        long finishTime = System.currentTimeMillis();

        System.out.println("That took 1: " + (finishTime-startTime)+ " ms");

        long startTime2 = System.currentTimeMillis();

        category.update(wordFrequencies);
        updatePriors();

        long finishTime2 = System.currentTimeMillis();

        System.out.println("That took 2: " + (finishTime2-startTime2)+ " ms");
    }

    public void handleTestText(String path) 
    {
        testTextAmount++;

        String[] text = handleText(path);
        List<Integer> wordFrequencies = countWordFrequencies(text);
        Category category = guessCategory(wordFrequencies);
        boolean correct = gui.askFeedback(path, category);

        if (correct) 
        {
            category.update(wordFrequencies);
            updatePriors();
            System.out.println("Kijk eens aan! De tekst is succesvol verwerkt.");
        } 
        else 
        {
            Category correctCategory = gui.askCategory();
            correctCategory.update(wordFrequencies);
            updatePriors();
            System.out.println("Kijk eens aan! De tekst is succesvol verwerkt.");
        }
    }

    public void updatePriors()
    {
        for (Category category : categories)
        {
            category.updatePrior();
        }
    }

    public Category guessCategory(List<Integer> wordFrequencies) 
    {
        List<Double> chances = new ArrayList<>();

        for (int i = 0; i < categories.size(); i++)
        {
            double chance = categories.get(i).getPrior();

            System.out.println("The prior is:" + chance);

            for(int j = 0; j < wordFrequencies.size(); j++)
            {
                chance = chance * categories.get(i).getWordProbabilities().get(j);
            }

            chances.add(chance);
        }

        double max = getMaxValue(chances);
        int index = chances.indexOf(max);

        System.out.println(max);
        System.out.println(index);
        return categories.get(index);
    }

    public double getMaxValue(List<Double> values)
    {
        Double max = 0.0;

        for (Double dubbel : values)
        {
            if(dubbel > max)
            {
                max = dubbel;
            }
        }

        return max;
    }

    public void equalizeAllLists()
    {
        for(Category category : categories)
        {
            if (category.getWordFrequencies().size() < vocabulary.size())
            {
                category.setWordFrequencies(equalizeIntList(category.getWordFrequencies()));
            }
        }

        for(Category category : categories)
        {
            if (category.getWordProbabilities().size() < vocabulary.size())
            {
                category.setWordProbabilities(equalizeDoubleList(category.getWordProbabilities()));
            }
        }
    }

    public List<Integer> equalizeIntList(List<Integer> list)
    {
        while (list.size() < vocabulary.size())
        {
            list.add(0);
        }

        return list;
    }

    public List<Double> equalizeDoubleList(List<Double> list)
    {
        while (list.size() < vocabulary.size())
        {
            list.add(0.0);
        }

        return list;
    }

    public void selectFeatures()
    {
        for(int i = 0; i < wordFrequencies.size(); i++)
        {
            if(wordFrequencies.get(i) < 2)
            {
                vocabulary.remove(i);
                wordFrequencies.remove(i);

                for(Category category : categories)
                {
                    category.removeFrequency(i);
                }
            }
        }
    }
}
mixel
  • 25,177
  • 13
  • 126
  • 165
Breus
  • 368
  • 1
  • 13
  • Can you phrase your question more clearly. What takes 50 ms and what takes 3-5ms is not clear – vinay Dec 28 '15 at 20:55
  • Sorry, edit is there, this method takes 50ms to execute for one text, while a set of six other methods only takes 2-3ms (both relatively simple). I know that this one is a bit harder but 50ms looks a bit odd to me. – Breus Dec 28 '15 at 20:58
  • This method makes a list of integers of how many times words from my vocabulary appear in the 'tokens' which is a tokenized text. – Breus Dec 28 '15 at 20:59
  • Can you show more code. We dont know what vocabulary variable really is – vinay Dec 28 '15 at 21:00
  • Everything works but really slow... PS: guys, thanks for your interest, i'm editting this post completely now! – Breus Dec 28 '15 at 21:03
  • 1
    I misread the code, you're right, it's correct. Strangely programmed, but correct. – JB Nizet Dec 28 '15 at 21:08
  • @JBNizet every help is appreciated if you know a better way, i'm just a beginner in Java. – Breus Dec 28 '15 at 21:22
  • I think rather than using List for vocabulary, you should use some kind of Set. For example, HashSet. This would be an improvement. – vinay Dec 28 '15 at 21:23
  • But the ordering of vocabulary is super important in my code so a Set won't do @vinay. – Breus Dec 28 '15 at 21:24
  • The use LinkedHashSet – vinay Dec 28 '15 at 21:25
  • The `add(i, counter)` could be replaced by `add(counter)`. You could even use an array of Strings, since you know the size of the list in advance, and it doesn't have to grow dynamically. – JB Nizet Dec 28 '15 at 21:25
  • @JBNizet an array of Integers :) – v.ladynev Dec 28 '15 at 21:33
  • @v.ladynev of course. It's getting late... – JB Nizet Dec 28 '15 at 21:35
  • I'm going to change `vocabulary` and `wordFrequencies` Lists to one `Map` tomorrow, thanks for your help! – Breus Dec 28 '15 at 21:36
  • @TotalCare: you may also benefit from getting your code reviewed on codereview.stackexchange.com – ChrisWue Dec 29 '15 at 07:09

4 Answers4

11

Your method has O(n*m) run time ( n being the vocabulary size and m the token size). With hashing this could be reduced to O(m) which is clearly better.

for (String token: tokens) {
  if(!map.containsKey(token)){
      map.put(token,0);
  }
  map.put(token,map.get(token)+1);
}
Sleiman Jneidi
  • 22,907
  • 14
  • 56
  • 77
  • Where is a `vocabulary` here? :) – v.ladynev Dec 28 '15 at 21:06
  • This is still O(n^2) behind the scenes. – Voicu Dec 28 '15 at 21:06
  • Check implementation of `containsKey`. Still doing nested looping. – Voicu Dec 28 '15 at 21:09
  • 1
    http://stackoverflow.com/questions/8923251/what-is-the-time-complexity-of-hashmap-containskey-in-java – Jake C. Dec 28 '15 at 21:11
  • 1
    @Voicu A loop for the worst case. `containsKey` has `O(1)` complexity – v.ladynev Dec 28 '15 at 21:11
  • I really appreciate your help guys, i'm a student in University and i'm highly interested in learning more of how Java works and some performance but two different answers makes me a bit confused. If @Voicu is right, I still don't know what to do. If @ v.ladynev is right I'm going to change my entire project. – Breus Dec 28 '15 at 21:12
  • 1
    @Voicu, I suggest **you** to check how hash maps work instead. The only case in which O(n^2) may happen is when all the hash codes of all the tokens are the same, which is never a real-world scenario. – user3707125 Dec 28 '15 at 21:12
  • 3
    @TotalCare read the question Jake linked. It's `O(1)` in general, and only `O(n)` in worst-case (bad hashing) scenarios. – azurefrog Dec 28 '15 at 21:13
  • 1
    btw chaps, HashMap has O(lgn) in worst-case, because it uses a TreeMap in the case of high collisions – Sleiman Jneidi Dec 28 '15 at 21:14
  • @user3707125 to nitpick, the hash codes don't need to be the same. Only to lead to the same bucket. But I agree: containsKey() for a HashMap is O(1), and this scenario has almost 0 chance to happen, unless the vocabulary is carefully chosen for this worst case to happen. – JB Nizet Dec 28 '15 at 21:15
  • It should be noted that the worst case in this scenario is if every single item stored in the map hashed to the same bucket, I don't know the exact probability of this but it is so unlikely I wouldn't even think about it. – Jake C. Dec 28 '15 at 21:16
3

Using a Map should dramatically increase performance, as Sleiman Jneidi suggested in his answer. This can be done, however, much more elegantly with Java 8's streaming APIs:

Map<String, Long> frequencies = 
    Arrays.stream(tokens)
          .collect(Collectors.groupingBy(Function.identity(), 
                                         Collectors.counting()));
Mureinik
  • 297,002
  • 52
  • 306
  • 350
  • Interesting. I didn't know about `Function.identity()` - it is a matter of style, though I usually use `UnaryOperator.identity()`. It extends `Function`, so may be used in a context that requires either. For this case, however, it is entirely a matter of opinion. – bcsb1001 Dec 28 '15 at 21:39
  • Thanks for your suggestion, what does this exactly do better in comparison to just making a Map? – Breus Dec 28 '15 at 21:53
  • @TotalCare do you mean in comparison to building the Map yourself? Mainly the fact that you don't have to. Mainly, it reduces the amount of code you need to write, and allows you on the "business logic" of your code, and offloads the boiler-plated part to the JDK. – Mureinik Dec 29 '15 at 09:46
  • @Mureink why do you use long instead of double? – Breus Dec 29 '15 at 10:45
  • Upvoted. An impressive answer for sure, but "concise" and "elegant" are not synonymous :-) – kervin Dec 29 '15 at 11:09
  • @TotalCare `Collectors.counting()` returns a long, so I went with it. Besides, a counter cannot be a floating point number, so unless you intend to have more than 2^63-1 occurrences of the same word, there really is no reason to use a double. – Mureinik Dec 29 '15 at 21:09
2

Instead of using a list for the vocabulary, and another one for the frequencies, I'd use a Map that will store word->frequency. That way you can avoid the double loop which in my mind is what kills your performance.

public Map<String,Integer> countWordFrequencies(String[] tokens) {
    // vocabulary is Map<String,Integer> initialized with all words as keys and 0 as value
    for (String word: tokens)
      if (vocabulary.containsKey(word)) {
        vocabulary.put(word, vocabulary.get(word)+1);
      }
    return vocabulary;
}
Nir Levy
  • 12,750
  • 3
  • 21
  • 38
  • The question doesnt say what is the datatype of vocabulary. – vinay Dec 28 '15 at 21:03
  • @vinay - since he uses `get(int)`, I assume it a list of some sort – Nir Levy Dec 28 '15 at 21:07
  • @NirLevy I used this, now I want to make also maps of the wordFrequencies and wordProbabilities of category, how do I make a map with all the exact Keys and all values are 0? – Breus Dec 29 '15 at 10:57
2

If you don't want to use Java 8 stuff you can try to use MultiSet from guava

v.ladynev
  • 19,275
  • 8
  • 46
  • 67
  • I do want to use anything there is, what can I use from Java 8 in your opinion? – Breus Dec 28 '15 at 21:46
  • 1
    @TotalCare Mureinik's [solution](http://stackoverflow.com/a/34500750/3405171) is the best one. It uses Java 8. – v.ladynev Dec 28 '15 at 21:51