3

I am writing a piece of code which models the evolution of a social network. The idea is that each person is assigned to a node and relationships between people (edges on the network) are given a weight of +1 or -1 depending on whether the relationship is friendly or unfriendly.

Using this simple model you can say that a triad of three people is either "balanced" or "unbalanced" depending on whether the product of the edges of the triad is positive or negative.

So finally what I am trying to do is implement an ising type model. I.e. Random edges are flipped and the new relationship is kept if the new network has more balanced triangels (a lower energy) than the network before the flip, if that is not the case then the new relationship is only kept with a certain probability.

Ok so finally onto my question: I have written the following code, however the dataset I have contains ~120k triads, as a result it will take 4 days to run!

Could anyone offer any tips on how I might optimise the code?

Thanks.

#Importing required librarys

try:
    import matplotlib.pyplot as plt
except:
    raise

import networkx as nx
import csv
import random
import math

def prod(iterable):
    p= 1
    for n in iterable:
        p *= n
    return p


def Sum(iterable):
    p= 0
    for n in iterable:
        p += n[3]
    return p


def CalcTriads(n):  
    firstgen=G.neighbors(n)
    Edges=[]
    Triads=[]

    for i in firstgen:
        Edges.append(G.edges(i))

    for i in xrange(len(Edges)):
        for j in range(len(Edges[i])):# For node n go through the list of edges (j) for the neighboring nodes (i) 
            if set([Edges[i][j][1]]).issubset(firstgen):# If the second node on the edge is also a neighbor of n (its in firstgen) then keep the edge.
                t=[n,Edges[i][j][0],Edges[i][j][1]]
                t.sort()
                Triads.append(t)# Add found nodes to Triads.

    new_Triads = []# Delete duplicate triads.
    for elem in Triads:
        if elem not in new_Triads:
            new_Triads.append(elem)
    Triads = new_Triads 

    for i in xrange(len(Triads)):# Go through list of all Triads finding the weights of their edges using G[node1][node2]. Multiply the three weights and append value to each triad.
            a=G[Triads[i][0]][Triads[i][1]].values()
            b=G[Triads[i][1]][Triads[i][2]].values()
            c=G[Triads[i][2]][Triads[i][0]].values()
            Q=prod(a+b+c)
            Triads[i].append(Q)

    return Triads



###### Import sorted edge data ######       
li=[]
with open('Sorted Data.csv', 'rU') as f:
    reader = csv.reader(f)
    for row in reader:
        li.append([float(row[0]),float(row[1]),float(row[2])])
G=nx.Graph()
G.add_weighted_edges_from(li)


for i in xrange(800000):
    e = random.choice(li)   # Choose random edge

    TriNei=[]
    a=CalcTriads(e[0]) # Find triads of first node in the chosen edge 
    for i in xrange(0,len(a)):
        if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
            TriNei.append(a[i])         
    preH=-Sum(TriNei) # Save the "energy" of all the triads of which the edge is a member


    e[2]=-1*e[2]# Flip the weight of the random edge and create a new graph with the flipped edge   
    G.clear()
    G.add_weighted_edges_from(li)


    TriNei=[]
    a=CalcTriads(e[0])  
    for i in xrange(0,len(a)):
        if set([e[1]]).issubset(a[i]):
            TriNei.append(a[i])
    postH=-Sum(TriNei)# Calculate the post flip "energy".   

    if postH<preH:# If the post flip energy is lower then the pre flip energy keep the change
        continue

    elif random.random() < 0.92: # If the post flip energy is higher then only keep the change with some small probability. (0.92 is an approximate placeholder for exp(-DeltaH)/exp(1) at the moment)
        e[2]=-1*e[2]
iRoygbiv
  • 865
  • 2
  • 7
  • 21
  • 2
    run a profiler (http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) and find where the code spend most of the executing time - so you can improve it. – darlinton Jul 17 '11 at 18:24

4 Answers4

5

The following suggestions won't boost your performance that much because they are not on the algorithmic level, i.e. not very specific to your problem. However, they are generic suggestions for slight performance improvements:

Unless you are using Python 3, change

for i in range(800000):

to

for i in xrange(800000):

The latter one just iterates numbers from 0 to 800000, the first one creates a huge list of numbers and then iterates that list. Do something similar for the other loops using range.

Also, change

j=random.choice(range(len(li))) 
e=li[j] # Choose random edge

to

e = random.choice(li)

and use e instead of li[j] subsequently. If you really need a index number, use random.randint(0, len(li)-1).

Oben Sonne
  • 9,893
  • 2
  • 40
  • 61
4

There are syntactic changes you can make to speed things up, such as replacing your Sum and Prod functions with the built-in equivalents sum(x[3] for x in iterable) and reduce(operator.mul, iterable) - it is generally faster to use builtin functions or generator expressions than explicit loops.

As far as I can tell the line:

    if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)

is testing if a float is in a list of floats. Replacing it with if e[1] in a[i]: will remove the overhead of creating two set objects for each comparison.

Incidentally, you do not need to loop through the index values of an array, if you are only going to use that index to access the elements. e.g. replace

for i in range(0,len(a)):
    if set([e[1]]).issubset(a[i]): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
    TriNei.append(a[i])    

with

for x in a:
    if set([e[1]]).issubset(x): # Keep triads which contain the whole edge (i.e. both nodes on the edge)
    TriNei.append(x)    

However I suspect that changes like this will not make a big difference to the overall runtime. To do that you either need to use a different algorithm or switch to a faster language. You could try running it in pypy - for some cases it can be significantly faster than CPython. You could also try cython, which will compile your code to C and can sometimes give a big performance gain especially if you annotate your code with cython type information. I think the biggest improvement may come from changing the algorithm to one that does less work, but I don't have any suggestions for that.

BTW, why loop 800000 times? What is the significance of that number?

Also, please use meaningful names for your variables. Using single character names or shrtAbbrv does not speed the code up at all, and makes it very hard to follow what it is doing.

Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
  • Great, thank you very much. 800,000 is a ball park figure. It just happens to be the number that the guys working on this before me used. 300,000 or so may do the trick also. I'll have to check with my supervisor! – iRoygbiv Jul 17 '11 at 20:00
  • Could you not break out of the loop if the network is deemed good enough? e.g. if your random change has not improved the "energy" for the last X times round the loop? You would have to experiment with the value of X to stop you hitting a local maxima, but I would suspect that for reasonable sized networks your would reach a maxima well before 800,000 iterations. – Dave Kirby Jul 17 '11 at 20:12
2

There are quite a few things you can improve here. Start by profiling your program using a tool like cProfile. This will tell you where most of the program's time is being spent and thus where optimization is likely to be most helpful. As a hint, you don't need to generate all the triads at every iteration of the program.

You also need to fix your indentation before you can expect a decent answer.

Regardless, this question might be better suited to Code Review.

Community
  • 1
  • 1
RoundTower
  • 899
  • 5
  • 9
1

I'm not sure I understand exactly what you are aiming for, but there are at least two changes that might help. You probably don't need to destroy and create the graph every time in the loop since all you are doing is flipping one edge weight sign. And the computation to find the triangles can be improved.

Here is some code that generates a complete graph with random weights, picks a random edge in a loop, finds the triads and flips the edge weight...

import random 
import networkx as nx

# complete graph with random 1/-1 as weight
G=nx.complete_graph(5)
for u,v,d in G.edges(data=True):
    d['weight']=random.randrange(-1,2,2) # -1 or 1

edges=G.edges()
for i in range(10):
    u,v = random.choice(edges) # random edge
    nbrs = set(G[u]) & set(G[v]) - set([u,v]) # nodes in traids
    triads = [(u,v,n) for n in nbrs]
    print "triads",triads
    for u,v,w in triads:
        print (u,v,G[u][v]['weight']),(u,w,G[u][w]['weight']),(v,w,G[v][w]['weight'])
    G[u][v]['weight']*=-1
Aric
  • 24,511
  • 5
  • 78
  • 77
  • ah, that's very useful. Thanks a lot. All I am aiming for is faster/more efficient code. This is the first thing I've done in python.- as you can probably tell :) – iRoygbiv Jul 18 '11 at 10:00