Text Search based algorithm not behaving as intended

Question

Update

I've updated the question with newer code suggested by fellow SO users and will be clarifying any ambiguous text that was previously there.

Update #2

I only have access to the log files generated by the application in question. Thus I'm constrained to work within the content of the log files and no solutions out of that scope is quite possible. I'll have modified the sample data a little bit. I would like to point out the following key variables.

Thread ID - Ranges from 0..19 - A thread is used multiple times. Thus ScriptExecThread(2) could show up multiple times within the logs.

Script - Every thread will run a script on a particular file. Once again, the same script may run on the same thread but won't run on the same thread AND file.

File - Every Thread ID runs a Script on a File. If Thread(10) is running myscript.script on myfile.file, then that EXACT line won't be executed again. A successful example using the above example would be something like so.

------START------

Thread(10) starting myscript.script on myfile.file

Thread(10) finished myscript.script on myfile.file

------END-------

An unsuccessful example using the above example would be:

------START------

Thread(10) starting myscript.script on myfile.file

------END------

Before addressing my query I'll give a rundown of the code used and the desired behavior.

Summary

I'm currently parsing large log files (take an average of 100k - 600k lines) and am attempting to retrieve certain information in a certain order. I've worked out the boolean algebra behind my request which seemed to work on paper but no so much on code (I must've missed something blatantly obvious). I would like to inform in advance that the code is not in any shape or form optimized, right now I simply want to get it to work.

In this log file you can see that certain threads hang up if they started but never finished. The number of possible thread IDs ranges. Here is some pseudo code:

    REGEX = "ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)" //in java
    Set started, finished
    for (int i=log.size()-1; i >=0; i--) {
    if(group(2).contains("starting")
        started.add(log.get(i))
    else if(group(2).contains("finished")
        finished.add(log.get(i)    
    }
    started.removeAll(finished);

Search Hung Threads

Set<String> started = new HashSet<String>(), finished = new HashSet<String>();
            
for(int i = JAnalyzer.csvlog.size()-1; i >= 0; i--) {
    if(JAnalyzer.csvlog.get(i).contains("ScriptExecThread")) 
        JUtility.hasThreadHung(JAnalyzer.csvlog.get(i), started, finished);     
}
started.removeAll(finished);
            
commonTextArea.append("Number of threads hung: " + noThreadsHung + "\n");
for(String s : started) { 
    JLogger.appendLineToConsole(s);
    commonTextArea.append(s+"\n");
}

Has Thread Hung

public static boolean hasThreadHung(final String str, Set<String> started, Set<String> finished) {      
    Pattern r = Pattern.compile("ScriptExecThread(\\([0-9]+\\)).*?(finished|starting)");
    Matcher m = r.matcher(str);
    boolean hasHung = m.find();
    
        if(m.group(2).contains("starting"))
            started.add(str);
        else if (m.group(2).contains("finished"))
            finished.add(str);
        
        System.out.println("Started size: " + started.size());
        System.out.println("Finished size: " + finished.size());
        
    return hasHung;
}

Example Data

ScriptExecThread(1) started on afile.xyz

ScriptExecThread(2) started on bfile.abc

ScriptExecThread(3) started on cfile.zyx

ScriptExecThread(4) started on dfile.zxy

ScriptExecThread(5) started on efile.yzx

ScriptExecThread(1) finished on afile.xyz

ScriptExecThread(2) finished on bfile.abc

ScriptExecThread(3) finished on cfile.zyx

ScriptExecThread(4) finished on dfile.zxy

ScriptExecThread(5) finished on efile.yzy

ScriptExecThread(1) started on bfile.abc

ScriptExecThread(2) started on dfile.zxy

ScriptExecThread(3) started on afile.xyz

ScriptExecThread(1) finished on bfile.abc

END OF LOG

If you example this, you'll noticed Threads number 2 & 3 started but failed to finished (reason is not necessary, I simply need to get the line).

Sample Data

09.08 15:06.53, ScriptExecThread(7),Info,########### starting

09.08 15:06.54, ScriptExecThread(18),Info,###################### starting

09.08 15:06.54, ScriptExecThread(13),Info,######## finished in #########

09.08 15:06.54, ScriptExecThread(13),Info,########## starting

09.08 15:06.55, ScriptExecThread(9),Info,##### finished in ########

09.08 15:06.55, ScriptExecThread(0),Info,####finished in ###########

09.08 15:06.55, ScriptExecThread(19),Info,#### finished in ########

09.08 15:06.55, ScriptExecThread(8),Info,###### finished in 2777 #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(8),Info,####### starting

09.08 15:06.55, ScriptExecThread(0),Info,##########starting

09.08 15:06.55, ScriptExecThread(19),Info,Post ###### finished in #####

09.08 15:06.55, ScriptExecThread(0),Info,###### finished in #########

09.08 15:06.55, ScriptExecThread(19),Info,########## starting

09.08 15:06.55, ScriptExecThread(0),Info,########### starting

09.08 15:06.55, ScriptExecThread(9),Info,########## starting

09.08 15:06.56, ScriptExecThread(1),Info,####### finished in ########

09.08 15:06.56, ScriptExecThread(17),Info,###### finished in #######

09.08 15:06.56, ScriptExecThread(17),Info,###################### starting

09.08 15:06.56, ScriptExecThread(1),Info,########## starting

Currently the code simply displays the entire log file with lines started with "starting". Which does somewhat make sense when I review the code.

I have removed any redundant information that I don't wish to display. If there is anything that I might have left out feel free to let me know and I'll add it.

there seems to be no thread in your sample with ID 3; how are you generating that output?. — tucuxi, Apr 27 '15 at 13:16
@tucuxi - The data I provided is simply a simply to give a glimpse of how part of the logs look like. The actual data is 600,000+ lines long — Juxhin, Apr 27 '15 at 13:17
What is purpose of this line: `else if (hasFinished && hasStarted)`? Is it possible for one line of log to contain both "starting" and "finished" message? — Alex Salauyou, Apr 27 '15 at 13:17
@SashaSalauyou - Not exactly, a single line may contain neither (rarely but possible) — Juxhin, Apr 27 '15 at 13:18
@Juxhin I still miss the point. By `hasThreadStarted()` you check if particular line contains "starting" message for particular threadId, by `hasThreadFinished()` you do the same for "finished" message. So may one line contain both messages? — Alex Salauyou, Apr 27 '15 at 13:21
@SashaSalauyou - No the line may only contain one or the either (not both). So I check if it has started, has finished and if neither it means it doesn't contain either or. — Juxhin, Apr 27 '15 at 13:25
I believe this can be made in far more simple manner, where you define a `Set` to hold IDs of started, but not finished yet, threads. On "starting" message, add to the set, on "finished" - remove. Then just print its content. — Alex Salauyou, Apr 27 '15 at 13:26
@Juxhin once again: this statement `else if (hasFinished && hasStarted)` checks if line `j` contains **both** messages for thread `i`, according to execution logic. — Alex Salauyou, Apr 27 '15 at 13:28
@SashaSalauyou - That does sound like an interesting solution actually. Mind providing an example — Juxhin, Apr 27 '15 at 13:28
@Juxhin this is very obvious solution requiring O(n), not O(n*k) operations. — Alex Salauyou, Apr 27 '15 at 13:30
@SashaSalauyou - So it seems, take a look at tucuxi's solution as it seems fairly similar to yours. Mind confirming? — Juxhin, Apr 27 '15 at 13:30
@Juxhin that's because this is the first which comes in mind ;) — Alex Salauyou, Apr 27 '15 at 13:31
I have gone ahead and updated the question to clarify any ambiguous text. — Juxhin, Apr 28 '15 at 09:53
@Juxhin According to your example data, a thread-Id can be reused, once the thread has been finished, is that true? — isnot2bad, May 02 '15 at 12:29
@isnot2bad - Yes, the main variables are the `Thread ID`, `Script` and `File` like so - `Thread(X) example.script starting on file.abc`. The same exact script won't be performed more than once on a file however the Thread ID will be using over and over on multiple scripts and files. — Juxhin, May 02 '15 at 12:43
@Juxhin So it is not sufficient to just collect a set of started threads and compare it to a set of stopped threads as in case a thread has been started twice but stopped only once it would appear in both sets. — isnot2bad, May 02 '15 at 12:46
@isnot2bad - Indeed, I was thinking (this is purely theoretical, I won't be able to work on this till the next few days) to map the line like so for both started and finished, then I compare both ID and Script name with the file name, if they all match then the two lines (for both start and finished) can be safely removed from the final result. — Juxhin, May 02 '15 at 12:51

tucuxi · Answer 1 · 2015-04-30T12:59:39.573

If I understand correctly, you have large files and are trying to find patterns of the form "X started (but no mention of X finished)" for all numerical values of X.

If I were to do this, I would use this pseudocode:

Pattern p = Pattern.compile(
   "ScriptExecThread\\(([0-9]+).*?(finished|started)");
Set<Integer> started, finished;
Search for p; for each match m,
     int n = Integer.parseInt(m.group(1));
     if (m.group(2).equals("started")) started.add(n);
     else finished.add(n);
started.removeAll(finished); // found 'em: contains started-but-not-finished

This requires a single regex pass over each file and an O(size-of-finished) set substraction; it should be 20x faster than your current approach. The regex would use optional (|) matching to look for both alternatives at once, reducing the number of passes.

Edit: made regex explicit. Compiling the regex once instead of once-per-line should shave off some extra run-time.

Edit 2: implemented pseudocode, works for me

Edit 3: replaced implementation to show file & line. Reduces memory requirements (does not load whole file into memory); but printing the line does require all "start" lines to be stored.

public class T {

    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(   
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> started = new HashMap<Integer, String>();
        Set<Integer> finished = new HashSet<Integer>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting")) started.put(n, d);
                else finished.add(n);
            }                
        }
        for (int f : finished) {
            started.remove(f);
        }
        return started.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Input: sample data above, as the first argument, called "data.txt". The same data in another file called "data2.txt" as the second argument (javac T.java && java T data.txt data2.txt). Output:

data.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished
data2.txt: '    09.08 15:06.54, ScriptExecThread(18),Info,###################### starting' unfinished
data2.txt: '    09.08 15:06.53, ScriptExecThread(7),Info,########### starting' unfinished

Another user indeed suggested the same approach (using Set). Could I kindly ask for a more elaborate answer? (The pseudo makes sense) — Juxhin, Apr 27 '15 at 13:29
@tucuxi - Could you take another look at the updated question? — Juxhin, Apr 30 '15 at 10:49
When implementing the pseudocode, you appear to have ignored the loop, and added strings instead of matched-numbers. Also, you return a boolean, instead of a set of numbers identifying which threads have started but not finished. — tucuxi, Apr 30 '15 at 11:32
Thanks @tucuxi for your time. I'll attempt this shortly and keep you updated. — Juxhin, Apr 30 '15 at 11:43
@tucuxi - This does indeed work however I would rather get the entire line that shows the thread hung which is why I was making use of `String`. As a thread ID can be used many times throughout the entire log. I'm trying to modify your current snippet to see if I can snatch which thread was hung. I'm thinking of pairing the ID with the file name. If that ID on that File has started but shows no trace of finishing, then it has hung and I can just append the line that caused this. — Juxhin, Apr 30 '15 at 12:00
I hope that each log uses unique IDs. Had some fun making the code more efficient, and printing actual lines; enjoy. I will not be fixing the code further. — tucuxi, Apr 30 '15 at 13:02
@tucuxi - I didn't quite get your last point, I replicated your code but still didn't get the correct output (possibly because it's a larger file). I'll examine the logs later next week and rewrite the entire thing. I thank you for your time. — Juxhin, May 01 '15 at 16:14
You commented that some IDs were reused. That would mean that the program would not work right if they are reused within the same log-file. You can easily add checks to verify if IDs were reused (for example, if the same ID starts more than once, or finishes more than once) — tucuxi, May 01 '15 at 21:10
Well a thread Id runs multiple scripts on multiple files so I was going to pair the > or other way round but won't be working on it till next Wednesday. If you do want a proper log I wouldn't mind sending it privately so that you can see what I mean exactly. But you have done enough already. (Sent from phone) — Juxhin, May 02 '15 at 09:55

score 2 · Accepted Answer · answered May 02 '15 at 21:06

Keeping two separate sets of started and finished threads (as described by @tucuxi) can't work. If a thread with ID 5 starts, runs, and finishes, then 5 will appear in the finished set, forever. If another thread with ID 5 starts, and hangs, it won't be reported.

Suppose, though, for a moment, that thread IDs aren't reused. Every thread ever created receives a new ID. By keeping separate started and finished sets, you'll have hundreds of thousands of elements in each by the time you're done. Performance on those data structures is proportional to what they've got in them at the time of the operation. It's unlikely that performance will matter for your use case, but if you were performing more expensive operations, or running on data 100 times larger, it might.

Preamble out of the way, here is a working version of @tucuxi's code:

import java.util.*;
import java.io.*;
import java.util.regex.*;

public class T {
    public static Collection<String> findHung(Iterable<String> data) {
        Pattern p = Pattern.compile(
            "ScriptExecThread\\(([0-9]+).*?(finished|starting)");
        HashMap<Integer, String> running = new HashMap<Integer, String>();
        for (String d : data) {
            Matcher m = p.matcher(d);
            if (m.find()) {
                int n = Integer.parseInt(m.group(1));
                if (m.group(2).equals("starting"))
                    running.put(n, d);
                else
                    running.remove(n);
            }
        }
        return running.values();
    }

    static Iterable<String> readFile(String path, String encoding) throws IOException {
        final Scanner sc = new Scanner(new File(path), encoding).useDelimiter("\n");
        return new Iterable<String>() {
            public Iterator<String> iterator() { return sc; }
        };
    }

    public static void main(String[] args) throws Exception {
        for (String fileName : args) {
            for (String s : findHung(readFile(fileName, "UTF-8"))) {
                System.out.println(fileName + ": '" + s + "' unfinished");
            }
        }
    }
}

Note that I've dropped the finished set, and the HashMap is now called running. When new threads start, they go in, and when the thread finishes, it is pulled out. This means that the HashMap will always be the size of the number of currently running threads, which will always be less than (or equal) to the total number of threads ever run. So the operations on it will be faster. (As a pleasant side effect, you can now keep track of how many threads are running on a log line by log line basis, which might be useful in determining why the threads are hanging.)

Here's a Python program I used to generate huge test cases:

#!/usr/bin/python

from random import random, choice
from datetime import datetime
import tempfile

all_threads = set([])
running = []
hung = []
filenames = { }

target_thread_count = 16
hang_chance = 0.001

def log(id, msg):
    now = datetime.now().strftime("%m:%d %H:%M:%S")
    print "%s, ScriptExecThread(%i),Info,%s" % (now, id, msg)

def new_thread():
    if len(all_threads)>0:
        for t in range(0, 2+max(all_threads)):
            if t not in all_threads:
                all_threads.add(t)
                return t
    else:
        all_threads.add(0)
        return 0

for i in range(0, 100000):
    if len(running) > target_thread_count:
        new_thread_chance = 0.25
    else:
        new_thread_chance = 0.75
        pass

    if random() < new_thread_chance:
        t = new_thread()
        name = next(tempfile._get_candidate_names())+".txt"
        filenames[t] = name
        log(t, "%s starting" % (name,))
        if random() < hang_chance:
            hung.append(t)
        else:
            running.append(t)
    elif len(running)>0:
        victim = choice(running)
        all_threads.remove(victim)
        running.remove(victim)
        log(t, "%s finished" % (filenames[victim],))

Very promising, thanks. Will be testing this out and keeping you updated on this on Wednesday. — Juxhin, May 03 '15 at 10:49
I thought of the same optimization (keeping only a set of running threads), but in the OP's examples, cases of "finished, started" were marked as "not hung", although this code would mark them as "hung". A more sophisticated version would keep a window, so that "finished, started" would be interpreted as "not hung" only if within a certain interval; and would signal many other log-file inconsistencies. — tucuxi, May 03 '15 at 11:12
This isn't an optimization; its the way you implement it correctly: While the question-provided code might describe "finished, started" as "not hung", @Juxhin states that the current code doesn't work. In the section labeled "Example Data", a case of "started, finished, started" is described as "failed to finish". Finally, the software generating the log files would have to be simply bizarre for "started, finished, started" to count as finished. That it will happen to be faster is an extremely pleasant, and (as I point out) largely irrelevant side effect. — Jay Kominek, May 03 '15 at 13:13
@JayKominek - The software obviously does it's best to avoid any hung threads. The only time threads hang is when the application crashes unexpectedly, thus I am writing this analytic tool in order to figure our which threads failed to finish before the application crashed so that I can debug further. — Juxhin, May 03 '15 at 16:00
@JayKominek - Seems to be working great, going to test it properly and verify. Thanks :-) — Juxhin, May 06 '15 at 09:42

score 1 · Answer 3 · answered May 05 '15 at 20:44

The removeAll will never work.
The hasThreadHung is storing the entire string.
So the values in started will never be matched by values in finished.

You want to do something like:

class ARecord {
    // Proper encapsulation of the members omitted for brevity
    String thread;
    String line;
    public ARecord (String thread, String line) {
        this.thread = thread;
        this.line = line;
    }
    public int hashcode() {
        return thread.hashcode();
    }
    public boolean equals(ARecord o) {
        return thread.equals(o.thread);
    }
}

Then in hasHungThread, create an ARecord and add that to the Sets.
Ex:

started.add(new ARecord(m.group(2), str));

In searchHungThreads you will retrieve the ARecord from the started and output it as:

for(ARecord rec : started) { 
    JLogger.appendLineToConsole(rec.line);
    commonTextArea.append(rec.line+"\n");
}

Yea reviewing my posted code doesn't make all to much sense now that I think about it. I will be attempting both yours and Jay Kominek's code shortly and keep you updated — Juxhin, May 06 '15 at 06:58

score 0 · Answer 4 · answered May 02 '15 at 19:50

0

Why not solve the problem in another way. If all you want is hung threads, can take thread stack programmatically. Can use an external tool also but internal to own JVM I presume would be easiest. Then expose that as an API or save a date-time stamped file regularly with thread dump. Another program just needs to analyze the thread dumps. If same thread is at same spot (same stack trace or over same 3-5 functions) over thread dumps its a good chance its hung.

There are tools that help you analyze https://www.google.com/search?q=java+thread+dump+tool+open+source

answered May 02 '15 at 19:50

tgkprog

4,493
4
41
70

I thank you for the input as it indeed an intuitive way (fun too!) around the problem. However the logs are generated on multiple other computers which I would then obtain. I do not have access to the application itself so I can't modify it to automatically detect hung threads like you've mentioned. I am simply parsing .csv log files and attempting to detect patterns which determine whether a thread has hung which would signify why the application may have crashed (imagine a thread running a single script on a single file for 48+ hours!) – Juxhin May 02 '15 at 19:55
There are tools that can get stack traces too. On linix just kill -9 or jstack on any JRE http://stackoverflow.com/questions/4876274/kill-3-to-get-java-thread-dump – tgkprog May 02 '15 at 19:57
Take a look at the update I just added. I do not wish to over-do this. This is merely an add-on on the current analytic application that I am writing which aims to also detect hung threads during the applications runtime. – Juxhin May 02 '15 at 20:05