-2

I want to check if a sentence contains a word from a list of words mapped to a category. So i have a class KeyValue.java with words, category names and a method filterCategory to check if it contains the word. Now i have a 10,000 keywords mapped different categories for the text. But the trouble is it is way to slow. Can you suggest some alternate methods to speed up the classification. Thanks for the help.

public class KeyValue {
private String key;
private String value;

public KeyValue(String key, String value) {
    this.key = key;
    this.value= value;
}
public KeyValue() {
}
public String getKey() {
    return key;
}
public void setKey(String key) {
    this.key = key;
}
public String getValue() {
    return value;
}
public void setValue(String value) {
    this.value = value;
}

Classification.java

class Classification
{

private static List<KeyValue> keyMap = new ArrayList<KeyValue>();

static{
    getWordMap();
}

public static List<KeyValue> getWordMap()
{           
    if(keyMap.size()==0)
    {
        keyMap.add(new KeyValue("sports","football"));
        keyMap.add(new KeyValue("sports","basketball"));
        keyMap.add(new KeyValue("sports","olympics"));
        keyMap.add(new KeyValue("sports","cricket"));
        keyMap.add(new KeyValue("sports","t20"));
    }
}

public static KeyValue filterCategory(String filteredText)
{               
    KeyValue kv = null;

    for(KeyValue tkv:keyMap)
    {
        String value = tkv.getValue();
        String lc = filteredText.toLowerCase();
        lc = FormatUtil.replaceEnglishSymbolsWithSpace(lc);//remove symbols with space and then normalizes it

        String lastWord="";
        if(lc.contains(" "))
        {
            lastWord = lc.substring(lc.lastIndexOf(" ")+1);

            if(lc.startsWith(value+" ") || lc.contains(" "+value+" ") || value.equals(lastWord))
            {
                kv = new KeyValue(tkv.getKey(), tkv.getValue());
                break;
            }               
        }
        else if(lc.contains(value))
        {
            kv = new KeyValue(tkv.getKey(), tkv.getValue());
            break;              
        }
    }

    if(kv==null)
    {
        return new KeyValue("general","0");
    }
    else 
    {
        kv.setValue("100");
        return kv;
    }
}
}
akay
  • 27
  • 9
  • Look into Guava's `Multimap`, as it's a good ADT to use for this. – Jacob G. Nov 08 '17 at 14:40
  • I am not sure it is going to improve the performance in any way. – akay Nov 08 '17 at 14:52
  • A quick test gave me quite an improvement when I moved the `lowercase` (and FormatUtil call) in front of the for loop. You do not have to do it for every key. – Malte Hartwig Nov 08 '17 at 15:22
  • Your right. I screwed up. – akay Nov 08 '17 at 15:35
  • You can even do the checks for `lastWord` and `contains(" ")` before the loop already, and then loop one way or the other, without checking again and again. Think of helper methods `containsKey(lc)` and `equalWord(lc, lastWord)`, in which you loop the keys. – Malte Hartwig Nov 08 '17 at 15:39

3 Answers3

0

Your implementation is sound but uses an Exhaustive or Brute-Force Search algorithm with your KeyValue object instead of a faster matching algorithm such as Hashing with a HashMap or Hashtable object.

Assumptions

  • You have 10,000 mapped words.
  • You are attempting to match those words against an English sentence or phrase such as "The quick brown fox jumps over the lazy dog"

The Problem

Your logic, as it is written, will perform a brute-force search attempting a possible 10,000 matches for each word in your sentence. Using the phrase given above will create (10,000) x (9) = 90,000 maximum attempts if each word in the sentence does not exist within your KeyValue object.

This logic creates a worst-case, or Big-O, performance hit of Θ(n) where n represents the number of words in your list. This is called a linear search. A lazy improvement to this method would be to use a sorted list, granting you a better Θ(log(n)) logarithmic search time.

The Fix

Instead of performing your brute-force search, use a hashing algorithm that will perform lookups on whole words at a time; or, if you want to perform pattern matching with character shifting, look at the Rabin—Karp Hash Algorithm. In the simplified case of just matching whole words, your algorithm will break down your sentence's words into tokens (like you have now), and then use a hash function lookup against your hashmap of values and associated categories.

Your new logic will carry a Big-O performance of Θ(1). This constant-time matching will greatly improve the speed of your application.

Pseudocode

// Adapting your KeyValue into a simple <Value, Key> map e.g. <"football", "sports">
//HashMap<String, String> map = new HashMap<String, String>();

// Adapting your KeyValue into a <Value, Set<Key>> map for multiple 
//    category keys e.g. <"football", <"sports","sunday","games">>
HashMap<String, Set<String>> map = new HashMap<String, Set<String>>();

// build the hashmap with your values and categories
Set<String> categories = new HashSet<String>();
categories.add("sports");
categories.add("sunday");
categories.add("games");
map.put("football", categories);
...

// sanitize your input
String lc = filteredText.toLowerCase();
lc = FormatUtil.replaceEnglishSymbolsWithSpace(lc);

// tokenize your sentence
String[] tokens = lc.split("\\s");
...

// search tokens against your hashmap
for (String token : tokens) {

    // search the token against the hashmap
    if (map.containsKey(token)){
        Set<String> cats = map.get(token);
        ...
    } else {
        ...
    }
}
outkst
  • 11
  • 3
  • This is not 0(1) . We are still going to loop through the tokens and depending upon the sentence. It might be newspaper article with lots of tokens. – akay Nov 09 '17 at 03:46
  • @akay The act of matching a word in the sentence against the HashMap of tokens becomes **Θ(1)** instead of **Θ(n)**. This is because the hash function is a constant-time lookup compared to the linear attempt of matching against each and every token in the list until found (or not found). Can you please explain further if there is something I am missing? – outkst Nov 15 '17 at 22:01
  • We have 10,000 words mapped to 100 categories. To find whether a sentence contains that keyword or not you have to check every word against 10,000 in the worst case. How can this be 0(1) ? If you can post any code that will work faster than the code i have posted below it will help alot. – akay Nov 16 '17 at 09:44
-1

I don't understand why you do not use Java's util.Map for this concern, but i advise you to iterate use :

lc = FormatUtil.replaceEnglishSymbolsWithSpace(lc);
String result= Arrays.stream(lc.split(" ")).filter(s -> s.equals(value)).findFirst().orElse("");
            if(result.length()>0) {
                kv = tkv;
            }
Ben Sch
  • 104
  • 1
  • 6
  • Op's condition for a match is different. In case there are multiple words, they only match if they **equal**. Otherwise it matches on **contains** already. – Malte Hartwig Nov 08 '17 at 15:23
-1

Based on the suggestions i am posting the fastest code i could come up with.

KeyValue based List has been modified into simple HashMap

private static HashMap<String,String> map = new HashMap<String,String>();

Thanks for the suggestions. It's now scalable to be put into production.

public static KeyValue filterCategory(String filteredText)
{               
    KeyValue kv = null;
    filteredText = filteredText.toLowerCase();
    filteredText = FormatUtil.replaceEnglishSymbolsWithSpace(filteredText);

    StringTokenizer tokenizer = new StringTokenizer(filteredText);
    while(tokenizer.hasMoreTokens()) {
        String temp = tokenizer.nextToken();
        if(map.containsKey(temp))
        {
            kv = new KeyValue(map.get(temp),"100");
            break;
        }
    }       
    if(kv==null)
    {
        kv= new KeyValue("general","0");
    }
    return kv;
}
akay
  • 27
  • 9