2

I have a very large input file with the following format:

ID \t time \t duration \t Description \t status

The status column is limited to contain either lower case a,s,i or upper case A,S,I or a mixed of the two (sample element in status col: a,si, I, asi, ASI, aSI, Asi...)

The ultimate purpose is to cluster events that start and end at a "close enough' times in order to recognize that those events contribute to a bigger event. Close enough here could be determined by a theta, let's say for now is 1 hour (or it could be 2 hours, or more, etc.). If two events that have start times within 1 hour and end times within 1 hour, we cluster them together and say that they are part of a big event.

One other thing is that I only care about events that has all lower case letters in the status

So below is my initial input processing:

I filter the input file so that it only contains rows that have all lower case letters
This task is already done in Python using MapReduce and Hadoop

Then I take the output file and split it into "boxes" where each box represents a time range (ex: 11am-12pm - box1, 12pm-1pm - box2, 1pm-2pm - box 3, etc...)

Then use MapReduce again to sort each box based on start time (ascending order)

The final output is an ordered list of start time

Now I want to develop an algorithm to group "similar events" together in the output above. Similar events are determined by start and end time.

my current algorithm is:

pick the first item in the list

find any event in the list that has start time is within 1 hour with first event start time and duration is +/- 1 hour with first item duration (duration determines the end time). 

Then cluster them together (basically I want to cluster events that happen at the same time frame)

If no other event found, move to the next available event (the one that has not been clustered).

Keep doing this until no more events to be clustered.

I'm not sure if my algorithm will work or it's efficient. I'm trying to do an algorithm that is better than O (n^2), so hierarchical clustering won't work. K-means might not work also since I don't know ahead of time how many clusters I will need.

There could be some other clustering algorithms that might fit better in my case. I think I might need some equations in my algorithm to calculate the distance in my cluster in order to determine similarity. I'm pretty new to this field, so any help to direct me to the right path is greatly appreciated.

Thanks a lot in advance.

LKT
  • 311
  • 1
  • 7
  • 17
  • perhaps the following link might be helpful (but in any case one has to estimate the number of clusters separately) http://stackoverflow.com/questions/7869609/cluster-one-dimensional-data-optimally – John Donn Feb 03 '15 at 11:35
  • a: 10:10 - 10:30; b: 10:14 - 10:34; c: 10:18 - 10:38. Would you cluster a, b and c together? – Alexey Birukov Feb 03 '15 at 13:37
  • @AlexeyBirukov According to my algorithm, I will cluster a and b, but not c. For simplicity, I state the theta = 5 mins, but in fact it could be 1 hour, or more if it makes more sense. I want to cluster similar events together to recognize a big event that all the events happened in. – LKT Feb 03 '15 at 19:43

1 Answers1

4

You could try DBSCAN density-based clustering algorithm which is O(n log n) (garanteed ONLY in case of using indexing data structure like kd-tree, ball-tree etc, otherwise it's O(n^2)). Events that are part of a bigger event should be in a areas of high density. It seem to be a great fit to your problem.

You might need to implement your own distance function in order to perform neighborhood query (to find nearest events). Also, it would be better to represent event time in POSIX-time format.

Here is an example.

Depending on the environment you use, DBSCAN implementation is in Java (ELKI), Python (scikit-learn), R (fpc).

  • @lurii Thanks for the suggestion. I'm still a little confused, is it correct that DBSCAN will need a distance function where I have to implement based on the algorithm that I described above? I'm still very new to DBSCAN using indexing data structure, could you point me to where I can learn more about it? I'm planning to do this project in Python. btw, I read about OPTIC while looking up DBSCAN, is OPTIC better than DBSCAN in this case? – LKT Feb 05 '15 at 02:00
  • It should be noted that implementation quality varies. The `fpc` version is incredibly slow, for example. The ELKI version appears to be fastest. (ELKI also has OPTICS, so you can try both.) – Has QUIT--Anony-Mousse May 26 '15 at 16:54