1

I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String called userDescription which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>> which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>.

The file is organized as: <User=A>\t<Tweet="tweet...">\n <User=A>\t<Tweet="tweet2...">\n <User=B>\t<Tweet="tweet3...">\n ....

My method to do this is:

for (File file : tweetList) {
        if (file.getName().endsWith(".txt")) {
            System.out.println(file.getName());
            BufferedReader in;
            try {
                in = new BufferedReader(new FileReader(file));
                String str;
                while ((str = in.readLine()) != null) {
                    // String split[] = str.split("\t");
                    String split[] = ptnTab.split(str);
                    String user = ptnEquals.split(split[1])[1];
                    String tweet = ptnEquals.split(split[2])[1];
                    // String user = split[1].split("=")[1];
                    // String tweet = split[2].split("=")[1];

                    if (tweet.length() == 0)
                        continue;

                    if (!prevUser.equals(user)) {
                        description = userDescription.get(user);
                        if (description == null)
                            continue;
                        if (prevUser.length() > 0 && wordsCount.size() > 0) {
                            for (String profileWord : description) {
                                if (wordsCorr.containsKey(profileWord)) {
                                    HashMap<String, Integer> temp = wordsCorr
                                            .get(profileWord);
                                    wordsCorr.put(profileWord,
                                            addValues(wordsCount, temp));
                                } else {
                                    wordsCorr.put(profileWord, wordsCount);
                                }
                            }
                        }
                        // wordsCount = new HashMap<String, Integer>();
                        wordsCount.clear();
                    }
                    setTweetWordCount(wordsCount, tweet);
                    prevUser = user;
                }
            } catch (IOException e) {
                System.err.println("Something went wrong: "
                        + e.getMessage());
            }
        }
    }

Here, the method setTweetWord counts the word frequency of all the tweets of a single user. The method is:

private void setTweetWordCount(HashMap<String, Integer> wordsCount,
            String tweet) {

        ArrayList<String> currTweet = new ArrayList<String>(
                Arrays.asList(removeUnwantedStrings(tweet)));

        if (currTweet.size() == 0)
            return;

        for (String word : currTweet) {
            try {
                if (word.equals("") || word.equals(null))
                    continue;
            } catch (NullPointerException e) {
                continue;
            }

            Integer countWord = wordsCount.get(word);
            wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
        }
    }

The method addValues checks to see if wordCount has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr.

Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.

Also, I have increased the heap size to 1500m which is the maximum that I can go.

Is there anything I might be doing wrong?

Thank you for your help!

EDIT: Profiling Results first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split() and other optimization. However, after profiling, my addValues method is taking the highest time. So, here's my code for addValues. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess method a bit.

  private HashMap<String, Integer> addValues(
            HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {

        HashMap<String, Integer> merged = new HashMap<String, Integer>();

        for (String x : wordsCount.keySet()) {
            Integer y = temp.get(x);
            if (y == null) {
                merged.put(x, wordsCount.get(x));
            } else {
                merged.put(x, wordsCount.get(x) + y);
            }
        }

        for (String x : temp.keySet()) {
            if (merged.get(x) == null) {
                merged.put(x, temp.get(x));
            }
        }
        return merged;
    }

EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!

javaCity
  • 4,288
  • 2
  • 25
  • 37

6 Answers6

2

Two things come to mind:

  • You are using String.split(), which uses a regular expression to do the splitting. That's completely oversized. Use one of the many splitXYZ() methods from Apache StringUtils instead.
  • You are probably creating really huge hash maps. When having very large hash maps, the hash collisions will make the hashmap functions much slower. This can be improved by using more widely spread hash values. See an example over here: Java HashMap performance optimization / alternative
Community
  • 1
  • 1
Bananeweizen
  • 21,797
  • 8
  • 68
  • 88
  • You're correct. However, if those were the problems, the slow execution of the program would rather start at the middle or towards the end of the program right? At the beginning when the memory is fresh and there are not much elements in HashMap it should be significantly fast. But in my program that is not the case. The speed is quite consistent. In other words, it is slow from the beginning. – javaCity May 22 '12 at 18:41
  • never prematurely optimize - be sure you are optimizing what is the bottleneck. always use a profiler. in this case, there are some obvious performance improvements you could make based on deductive reasoning, but in general, it's best to get empirical evidence – ianpojman May 23 '12 at 03:14
  • I did the profiling. The results are in the main post. Thank you – javaCity May 24 '12 at 07:05
1

One suggestion (I don't know how much of an improvement you'll get from it) is based on the observation that curTweet is never modified. There is no need for creating a copy. I.e.

ArrayList<String> currTweet = new ArrayList<String>(
            Arrays.asList(removeUnwantedStrings(tweet)));

can be replaced with

List<String> currTweet = Arrays.asList(removeUnwantedStrings(tweet));

or you can use the array directly (which will be marginally faster). I.e.

String[] currTweet = removeUnwantedStrings(tweet);

Also,

word.equals(null)

is always false by the definition of the contract of equals. The right way to null-check is:

if (null == word || word.equals(""))

Additionally, you won't need that null-pointer-exception try-catch if you do this. Exception handling is expensive when it happens, so if your word array tends to return lots of nulls, this could be slowing down your code.

More generally though, this is one of those cases where you should profile the code and figure out where the actual bottleneck is (if there is a bottleneck) instead of looking for things to optimize ad-hoc.

trutheality
  • 23,114
  • 6
  • 54
  • 68
  • Thank you for your comment. I am using linux command line to run the program. Is there anyway I can profile it there? – javaCity May 22 '12 at 18:26
  • 2
    You can VisualVM and attach it to the program while it's running: http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/profiler.html – ianpojman May 22 '12 at 18:27
1

You would gain from a few more optimizations:

  • String.split recompiles the input regex (in string form) to a pattern every time. You should have a single static final Pattern ptnTab = Pattern.compile( "\\t" ), ptnEquals = Pattern.compile( "=" ); and call, e.g., ptnTab.split( str ). The resulting performance should be close to StringTokenizer.
  • word.equals( "" ) || word.equals( null ). Lots of wasted cycles here. If you are actually seeing null words, then you are catching NPEs, which is very expensive. See the response from @trutheality above.
  • You should allocate the HashMap with a very large initial capacity to avoid all the resizing that is bound to happen.
Judge Mental
  • 5,209
  • 17
  • 22
  • his hashmap resize cost will be nothing compared to the pattern compilation and execution, I'm guessing. – ianpojman May 23 '12 at 03:13
0

split() uses regular expressions, which are not "fast". try using a StringTokenizer or something instead.

ianpojman
  • 1,743
  • 1
  • 15
  • 20
  • 1
    I recommend getting a profiler - you can use visualvm (comes with oracle's JDK) or a commercial product like YourKit - it will tell you exactly where your bottleneck is. I'm just guessing because I know from experience compiling and executing patterns is slow. You can precompile the pattern, which would give you a big boost (unless String.split is caching the compiled pattern, I'm not sure) – ianpojman May 23 '12 at 03:12
0

Have you thought about using db instead of Java. Using db tools you can load the data using dataload tools that comes with DB in tables and from there you can do set processing. One challenge that I see is loading data in table as fields are not delimited with common seprator like "'" or ":"

0

You could rewrite addValues like this to make it faster - a few notes:

  • I have not tested the code but I think it is equivalent to yours.
  • I have not tested that it is quicker (but would be surprised if it wasn't)
  • I have assumed that wordsCount is larger than temp, if not exchange them in the code
  • I have also replaced all the HashMaps by Maps which does not make any difference for you but makes the code easier to change later on

private Map<String, Integer> addValues(Map<String, Integer> wordsCount, Map<String, Integer> temp) {

    Map<String, Integer> merged = new HashMap<String, Integer>(wordsCount); //puts everyting in wordCounts

    for (Map.Entry<String, Integer> e : temp.entrySet()) {
        Integer countInWords = merged.get(e.getKey()); //the number in wordsCount
        Integer countInTemp = e.getValue();
        int newCount = countInTemp + (countInWords == null ? 0 : countInWords); //the sum
        merged.put(e.getKey(), newCount);
    }
    return merged;
}
assylias
  • 321,522
  • 82
  • 660
  • 783