3

I have been working on an assignment in that I have to read words from a file and find the longest word and check how many sub words contains in that longest word? this should work for all the words in the file.

I tried using java the code I wrote works for the small amount of data in file but my task is to process huge amount of data.

Example: File words: "call","me","later","hey","how","callmelater","now","iam","busy","noway","nowiambusy"

o/p: callmelater : subwords->call,me,later

In this I'm reading file words storing in linked list and then finding the longest word & removing it from the list then checking how many sub-words extracted word contains.

Main Class Assignment:

import java.util.Scanner;
public class Assignment {
public static void main (String[] args){
    long start = System.currentTimeMillis();;




    Assignment a = new Assignment();
    a.throwInstructions();

    Scanner userInput = new Scanner(System.in);
    String filename = userInput.nextLine();

//  String filename = "ab.txt";
//  String filename = "abc.txt";
    Logic testRun = new Logic(filename);
//  //testRun.result();

    long end = System.currentTimeMillis();;

    System.out.println("Time taken:"+(end - start) + " ms");
}

public void throwInstructions(){
    System.out.println("Keep input file in same directory, where the code is");
    System.out.println("Please specify the fie name : ");
}

Subclass Logic for processing:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class Logic {
private String filename;
private File file;
private List<String> words = new LinkedList<String>();
private Map<String, String> matchedWords = new HashMap();

@Override
public String toString() {
    return "Logic [words=" + words + "]";
}

// constructor
public Logic(String filename) {
    this.filename = filename;
    file = new File(this.filename);
    fetchFile();
    run();
    result();
}

// find the such words and store in map
public void run() {

    while (!words.isEmpty()) {
        String LongestWord = extractLongestWord(words);
        findMatch(LongestWord);
    }
}

// find longest word
private String extractLongestWord(List<String> words) {
    String longWord;
    longWord = words.get(0);
    int maxLength = words.get(0).length();
    for (int i = 0; i < words.size(); i++) {
        if (maxLength < words.get(i).length()) {
            maxLength = words.get(i).length();
            longWord = words.get(i);
        }
    }
    words.remove(words.indexOf(longWord));
    return longWord;
}

// find the match for word in array of sub words
private void findMatch(String LongestWord) {
    boolean chunkFound = false;
    int chunkCount = 0;
    StringBuilder subWords = new StringBuilder();
    for (int i = 0; i < words.size(); i++) {
        if (LongestWord.indexOf(words.get(i)) != -1) {
            subWords.append(words.get(i) + ",");
            chunkFound = true;
            chunkCount++;
        }
    }

    if (chunkFound) {
        matchedWords.put(LongestWord,
                "\t" + (subWords.substring(0, subWords.length() - 1))
                        + "\t:Subword Count:" + chunkCount);
    }
}

// fetch data from file and store in list
public void fetchFile() {
    String word;
    try {
        FileReader fr = new FileReader(file);
        BufferedReader br = new BufferedReader(fr);
        while ((word = br.readLine()) != null) {
            words.add(word);
        }
        fr.close();
        br.close();
    } catch (FileNotFoundException e) {
        // e.printStackTrace();
        System.out
                .println("ERROR: File -> "
                        + file.toString()
                        + " not Exists,Please check filename or location and try again.");
    } catch (IOException e) {
        // e.printStackTrace();
        System.out.println("ERROR: Problem reading -> " + file.toString()
                + " File, Some problem with file format.");
    }

}

// display result
public void result() {
    Set set = matchedWords.entrySet();
    Iterator i = set.iterator();
    System.out.println("WORD:\tWORD-LENGTH:\tSUBWORDS:\tSUBWORDS-COUNT");
    while (i.hasNext()) {
        Map.Entry me = (Map.Entry) i.next();
        System.out.print(me.getKey() + ": ");
        System.out.print("\t" + ((String) me.getKey()).length() + ": ");
        System.out.println(me.getValue());
    }
}
}

This is where my programs lacks and goes into some never ending loop. Complexity of my program is high. To reduce the processing time I need an efficient approach like Binary/merge sort approach which will take least time like O(log n) or O(nlog n).

If someone can help me with this or at least suggestion in which direction I should proceed. Also please suggest me which programming language would be good to implement such text processing tasks in fast way ?

Thanks in advance

Ventura
  • 33
  • 3
  • google `radix tree`. a good data structure is very important. – Jason Hu Mar 20 '15 at 21:03
  • [C++ may actually be slower](http://stackoverflow.com/questions/145110/c-performance-vs-java-c). Of course, it's unlikely he can switch, as it's an assignment. – MirroredFate Mar 20 '15 at 21:25
  • This can be solved by dynamic programming. Sort the words by length. Take the first word. Get first character 1, and then check to see if the rest of letters n-1 can be constructed from list of words. Again take 1 character from n-1 letters and then check if n-2 letters can be formed from list of words. – Sandeep Mar 20 '15 at 21:41
  • @Sandeep sorting the words is overhead. There is no need to sort. It's simpler to track the longest word. – Konsol Labapen Mar 20 '15 at 21:49
  • @KonsolLabapen OP says he wants to find the longest word. – Sandeep Mar 20 '15 at 21:53
  • @Sandeep if all you need is to find the longest item in a huge pile all you have to do is keep one item in your hand, and then for each new item you see compare the two. So instead of an O[ n*log(n)] algorightm, you can do it in O[n]. So in the list {4,7,3,5,9,2}, you start by holding 4. Then you compare with 7. Since 7 is greater you keep 7 instead. Then you compare 7 with 3 then 5. In both cases you keep 7 because it is greater. Then you keep 9 since 7<9 and then again you end up with 9 since 9 > 2. No sorting. The list remains the same. You just tracked the largest you meet so far: 9. – Konsol Labapen Mar 20 '15 at 22:31

2 Answers2

0

Not sure I understand your context, but from reading the problem description it sounds to me like a Linked List is an inappropriate data structure. You don't need to check every single word to the longest word.

A "trie" is probably a perfect data structure for this application.

But if you haven't learned about that in your class, then perhaps you can at least cut down your search space with hashtables. While you are doing the initial list processing calculating the longest word, you can simultaneously process each word into a hash table based on first letter. That way, when you are ready to check your longest word for subwords, you can check only those words with first letters in the longest word. (I'm assuming there could be overlapping words, unlike your example.)

Do you know anything about the input you will be receiving? If you have more details about the input word distribution, then you can customize your solution to the data you expect.

If you can choose your language, and time efficiency is important, you might want to switch to C++, as for many applications it's several times faster than Java.

Summer M
  • 11
  • 2
0

This problem requires a Trie. But you have to augment your trie: a generic one will not do. Geek Viewpoint has a good Trie written in Java. Where your particular work will happen is in the method getWordList. Your getWordList will take as input the longest word (i.e. longestWord) and then try to see if each substring comprises words that exist in the dictionary. I think I have given you enough -- I can't do your work for you. But if you have further question, don't hesitate to ask.

Other than in getWordList, you might be able to pretty much keep the trie from Geek Viewpoint the way it is.

You are also in luck because Geek Viewpoint demonstrates the trie using a Boggle example and your problem is a very very trivial version of Boggle.

Konsol Labapen
  • 2,424
  • 2
  • 15
  • 12
  • The problem is really not that difficult. But if you really aren't seeing it, the key is the `findWord` method: http://www.geekviewpoint.com/java/trie/is_word. So you take a substring `s` of longestWord, and all prefixes of `s` that are words are admissible in your final result set. – Konsol Labapen Mar 20 '15 at 21:53