I have a basic method which reads in ~1000 files with ~10,000 lines each from the hard drive. Also, I have an array of String
called userDescription
which has all the "description words" of the user. I have created a HashMap whose data structure is HashMap<String, HashMap<String, Integer>>
which corresponds to HashMap<eachUserDescriptionWords, HashMap<TweetWord, Tweet_Word_Frequency>>
.
The file is organized as:
<User=A>\t<Tweet="tweet...">\n
<User=A>\t<Tweet="tweet2...">\n
<User=B>\t<Tweet="tweet3...">\n
....
My method to do this is:
for (File file : tweetList) {
if (file.getName().endsWith(".txt")) {
System.out.println(file.getName());
BufferedReader in;
try {
in = new BufferedReader(new FileReader(file));
String str;
while ((str = in.readLine()) != null) {
// String split[] = str.split("\t");
String split[] = ptnTab.split(str);
String user = ptnEquals.split(split[1])[1];
String tweet = ptnEquals.split(split[2])[1];
// String user = split[1].split("=")[1];
// String tweet = split[2].split("=")[1];
if (tweet.length() == 0)
continue;
if (!prevUser.equals(user)) {
description = userDescription.get(user);
if (description == null)
continue;
if (prevUser.length() > 0 && wordsCount.size() > 0) {
for (String profileWord : description) {
if (wordsCorr.containsKey(profileWord)) {
HashMap<String, Integer> temp = wordsCorr
.get(profileWord);
wordsCorr.put(profileWord,
addValues(wordsCount, temp));
} else {
wordsCorr.put(profileWord, wordsCount);
}
}
}
// wordsCount = new HashMap<String, Integer>();
wordsCount.clear();
}
setTweetWordCount(wordsCount, tweet);
prevUser = user;
}
} catch (IOException e) {
System.err.println("Something went wrong: "
+ e.getMessage());
}
}
}
Here, the method setTweetWord
counts the word frequency of all the tweets of a single user. The method is:
private void setTweetWordCount(HashMap<String, Integer> wordsCount,
String tweet) {
ArrayList<String> currTweet = new ArrayList<String>(
Arrays.asList(removeUnwantedStrings(tweet)));
if (currTweet.size() == 0)
return;
for (String word : currTweet) {
try {
if (word.equals("") || word.equals(null))
continue;
} catch (NullPointerException e) {
continue;
}
Integer countWord = wordsCount.get(word);
wordsCount.put(word, (countWord == null) ? 1 : countWord + 1);
}
}
The method addValues checks to see if wordCount
has words that is already in the giant HashMap wordsCorr. If it does, it increases the count of the word in the original HashMap wordsCorr
.
Now, my problem is no matter what I do the program is very very slow. I ran this version in my server which has fairly good hardware but its been 28 hours and the number of files scanned is just ~450. I tried to see if I was doing anything repeatedly which might be unnecessary and I corrected few of them. But still the program is very slow.
Also, I have increased the heap size to 1500m which is the maximum that I can go.
Is there anything I might be doing wrong?
Thank you for your help!
EDIT: Profiling Results
first of all I really want to thank you guys for the comments. I have changed some of the stuffs in my program. I now have precompiled regex instead of direct String.split()
and other optimization. However, after profiling, my addValues
method is taking the highest time. So, here's my code for addValues
. Is there something that I should be optimizing here? Oh, and I've also changed my startProcess
method a bit.
private HashMap<String, Integer> addValues(
HashMap<String, Integer> wordsCount, HashMap<String, Integer> temp) {
HashMap<String, Integer> merged = new HashMap<String, Integer>();
for (String x : wordsCount.keySet()) {
Integer y = temp.get(x);
if (y == null) {
merged.put(x, wordsCount.get(x));
} else {
merged.put(x, wordsCount.get(x) + y);
}
}
for (String x : temp.keySet()) {
if (merged.get(x) == null) {
merged.put(x, temp.get(x));
}
}
return merged;
}
EDIT2: Even after trying so hard with it, the program didn't run as expected. I did all the optimization of the "slow method" addValues
but it didn't work. So I went to different path of creating word dictionary and assigning index to each word first and then do the processing. Lets see where it goes. Thank you for your help!