2

First of all I'm quite a Java beginner, so I'm not sure if this is even possible! Basically I have a huge (3+million) data source of relational data (i.e. A is friends with B+C+D, B is friends with D+G+Z (but not A - i.e. unmutual) etc.) and I want to find every cycle within this (not necessarily connected) directed graph.

I've found the thread Finding all cycles in graph, which has pointed me to Donald Johnson's (elementary) cycle-finding algorithm which, superficially at least, looks like it'll do what I'm after (I'm going to try when I'm back at work on Tuesday - thought it wouldn't hurt to ask in the meanwhile!).

I had a quick scan through the code of the Java implementation of Johnson's algorithm (in that thread) and it looks like a matrix of relations is the first step, so I guess my questions are:

a) Is Java capable of handling a 3+million*3+million matrix? (was planning on representing A-friends-with-B by a binary sparse matrix)

b) Do I need to find every connected subgraph as my first problem, or will cycle-finding algorithms handle disjoint data?

c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.

    E
   / \
  D---F
 / \ / \
C---B---A

d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.

z) Is this a P or NP task?! Am I biting off more than I can chew?

Thanks all, any help appreciated! Andy

Community
  • 1
  • 1
Andy
  • 21
  • 2
  • 1
    If you want to find _every_ cycle, then it is certainly not P, since for a complete graph you have more than n! cycles. On the other hand, if you just want to count the cycles (without outputting them), then it is P (and therefore also NP - P is a subset of NP). – Tomer Vromen May 30 '10 at 11:09
  • So do you want to find every subset of vertices where everyone in the subset is friends with everyone else in that subset? Because that problem is called Maximal Clique: http://en.wikipedia.org/wiki/Maximal_clique#Definitions – j_random_hacker May 30 '10 at 11:14
  • 1
    @Tomer: are you sure, that the problem of counting elementary circles in undirected graphs is in P? – jpalecek May 30 '10 at 11:31
  • @jpalecek: No, actually. I may have been confused with counting _all_ cycles of length at most k. – Tomer Vromen May 30 '10 at 11:55
  • @Andy: From your point c) in the question it seems like you can live without actually finding all cycles. So the finding all cycles seems to be a _solution_ to your actual underlying problem, rather than being the problem itself. Maybe you can describe what the problem is (if at all possible). Perhaps people here can find alternatives you hadn't considered... –  May 30 '10 at 14:06
  • @Moron: The full context of the problem is part of a banking fraud solution. I have access to transaction records, and am looking for rings of people transferring money between eachother. Have already subset those pairs of accounts making interesting/significant transfers (whose thresholds can be tightened, hence point D). I wasn't worried about point C because the elementary cycles could always be ran through the same algorithm again to get the 'parent' cycles... – Andy May 30 '10 at 16:14
  • @Andy: Given the context, maybe you might consider looging at circulations instead: http://courses.csail.mit.edu/6.854/06/scribe/s12-minCostFlowAlg.pdf – jpalecek May 30 '10 at 16:19

3 Answers3

2

What you're doing is similar to a very well studied problem in data mining, known as association rule mining or more generally frequent itemset mining. What you can find with frequent itemset mining, is a little bit more specific than what you're doing, but also more useful.

We'll go with closed frequent itemset mining, what this will do is find all groups of friends, where everyone is friends with each other.

I'm going to say this right now, that Java can't do what you want it to do. It can't load that much memory and it's not efficient enough to process that data in any reasonable amount of time, you're going to need to use C/C++. I'd suggest using LCM which is a closed frequent itemset miner, but you're also going to need set the support pretty high, because of the amount of data that you have.

Another thing you might want to consider is reading up on Large Graph Mining, which is also a fairly large area of research, but Java isn't going to cut it. Also you're not going to be able to find all the cycles in your data (unless it is incredibly sparse), there's going to be far too many of them. They're also going to be overlapping and not very meaningful, what you're going to maybe possibly be able to find is several of the largest cycles.

JSchlather
  • 1,564
  • 2
  • 13
  • 22
0

c) Is this actually an appropriate solution for the problem? My understanding of "elementary" cycles is that in the graph below, rather than picking out A-B-C-D-E-F it'll pick out A-B-F, B-C-D etc. but that's not the end of the world given the task.

    E
   / \
  D---F
 / \ / \
C---B---A

No. "Elementary" in the sense of Donald Johnson's paper means simple, that is, no node appears twice in the circle. That means the algorithm won't pick AFBCDBA, but will pick ABCDEF.

d) If necessary, I can simplify the problem by enforcing mutuality in relations - i.e. A-friends-with-B <==> B-friends-with-A, and if really necessary I can maybe cut down the data size, but realistically it is always going to be around the 1mil mark.

Undirected graphs have vector space of (non-simple) cycles (which has a nice basis, etc.), but I don't know if it'll help you.

z) Is this a P or NP task?

It is an enumeration problem, which, by itself, cannot be in P nor NP.

jpalecek
  • 47,058
  • 7
  • 102
  • 144
0

Finding every cycle does not sound reasonable. There will be exponential number of cycles, all overlapping each other.

As for P or NP, the slowest part is actually enumerating each cycle (because there can be so many of them). If there are no cycles, a linear algorithm exists.

Maybe you actually want to divide the graph in biconnected components? See http://en.wikipedia.org/wiki/Biconnected_component

Also it is almost NEVER reasonable to store graphs in matrices. It sounds nice in theory, but does not scale in practice, Use adjacency lists instead ( http://en.wikipedia.org/wiki/Adjacency_list )

Viesturs
  • 1,691
  • 4
  • 21
  • 25