Update
I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.
Update #2
I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.
Thread ID
- Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2)
could show up multiple times within the logs.
Script
- Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.
File
- Every Thread ID
runs a Script
on a File
. If Thread(10)
is running myscript.script
on myfile.file
, then that EXACT line won't be executed again. A successful example using the above example would be something like so.
------START------
Thread(10) starting myscript.script on myfile.file
Thread(10) finished myscript.script on myfile.file
------END-------
An unsuccessful example using the above example would be:
------START------
Thread(10) starting myscript.script on myfile.file
------END------
Before addressing my query I'll give a rundown of the code used and the desired behavior.
Summary
I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.
In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:
REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
Set started, finished
for (int i=log.size()-1; i >=0; i--) {
if(group(2).contains("starting")
started.add(log.get(i))
else if(group(2).contains("finished")
finished.add(log.get(i)
}
started.removeAll(finished);
Search Hung Threads
Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread"))
JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);
}
started.removeAll(finished);
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) {
JLogger.appendLineToConsole(s);
commonTextArea.append(s+"\n");
}
Has Thread Hung
public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {
Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
Matcher m = r.matcher(str);
boolean hasHung = m.find();
if(m.group(2).contains("starting"))
started.add(str);
else if (m.group(2).contains("finished"))
finished.add(str);
System.out.println("Started size: " + started.size());
System.out.println("Finished size: " + finished.size());
return hasHung;
}
Example Data
ScriptExecThread(1) started on afile.xyz
ScriptExecThread(2) started on bfile.abc
ScriptExecThread(3) started on cfile.zyx
ScriptExecThread(4) started on dfile.zxy
ScriptExecThread(5) started on efile.yzx
ScriptExecThread(1) finished on afile.xyz
ScriptExecThread(2) finished on bfile.abc
ScriptExecThread(3) finished on cfile.zyx
ScriptExecThread(4) finished on dfile.zxy
ScriptExecThread(5) finished on efile.yzy
ScriptExecThread(1) started on bfile.abc
ScriptExecThread(2) started on dfile.zxy
ScriptExecThread(3) started on afile.xyz
ScriptExecThread(1) finished on bfile.abc
END OF LOG
If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).
Sample Data
09.08 15:06.53, ScriptExecThread(7),Info,########### starting
09.08 15:06.54, ScriptExecThread(18),Info,###################### starting
09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########
09.08 15:06.54, ScriptExecThread(13),Info,########## starting
09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########
09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########
09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########
09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(8),Info,####### starting
09.08 15:06.55, ScriptExecThread(0),Info,##########starting
09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####
09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########
09.08 15:06.55, ScriptExecThread(19),Info,########## starting
09.08 15:06.55, ScriptExecThread(0),Info,########### starting
09.08 15:06.55, ScriptExecThread(9),Info,########## starting
09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########
09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######
09.08 15:06.56, ScriptExecThread(17),Info,###################### starting
09.08 15:06.56, ScriptExecThread(1),Info,########## starting
Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.
I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.