6

I have been trying to get over my fear of Cython (fear because I literally know NOTHING about c, or c++)

I have a function which takes 2 arguments, a set (we'll call it testSet), and a list of sets (we'll call that targetSets). The function then iterates through targetSets, and computes the length of the intersection with testSet, adding that value to a list, which is then returned.

Now, this isn't by itself that slow, but the problem is I need to do simulations of the testSet (and a large number at that, ~ 10,000), and the targetSet is about 10,000 sets long.

So for a small number of simulations to test, the pure python implementation was taking ~50 secs.

I tried making a cython function, and it worked and it's now running at ~16 secs.

If there is anything else that I could do to the cython function that anyone could think of that would be great (python 2.7 btw)

Here is my Cython implementation in overlapFunc.pyx

def computeOverlap(set testSet, list targetSets):
    cdef list obsOverlaps  = []
    cdef int i, N
    cdef set overlap
    N = len(targetSets)
    for i in range(N):
        overlap = testSet & targetSets[i]
        if len(overlap) <= 1:
            obsOverlaps.append(0)
        else:
            obsOverlaps.append(len(overlap))
    return obsOverlaps

and the setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

ext_modules = [Extension("overlapFunc", 
                         ["overlapFunc.pyx"])]

setup(
      name = 'computeOverlap function',
      cmdclass = {'build_ext': build_ext},
      ext_modules = ext_modules
      )

and some code to build some random sets for testing and to time the function. test.py

import numpy as np
from overlapFunc import computeOverlap
import time

def simRandomSet(n):
    for i in range(n):
        simSet= set(np.random.randint(low=1, high=100, size=50))
        yield simSet


if __name__ == '__main__':
    np.random.seed(23032014)
    targetSet = [set(np.random.randint(low=1, high=100, size=50)) for i in range(10000)]

    simulatedTestSets = simRandomSet(200)
    start = time.time()
    for i in simulatedTestSets:
        obsOverlaps = computeOverlap(i, targetSet)
    print time.time()-start

I tried changing the def at the start of the computerOverlap function, as in:

cdef list computeOverlap(set testSet, list targetSets):

but I get the following warning message when I run the setup.py script:

'__pyx_f_11overlapFunc_computeOverlap' defined but not used [-Wunused-function]

and then when I run something that tries to use the function I get an import Error:

    from overlapFunc import computeOverlap
ImportError: cannot import name computeOverlap

Thanks in advance for your help,

Cheers,

Davy

Davy Kavanagh
  • 4,809
  • 9
  • 35
  • 50
  • Aside: your `computeOverlap` function seems to have a bug; you're looping over `p` but indexing using `i` which you don't increment. – DSM Mar 23 '14 at 03:31
  • Your `if` test is redundant. The whole function could be written as `return [len(testSet & target) for target in targetSets]`. I believe it would also run faster, though I don't know enough about Cython to be completely sure. – user2357112 Mar 23 '14 at 10:10
  • Sorry for the bad example code. It was late and I started playing around with things after I started writing the question. Sloppy! – Davy Kavanagh Mar 23 '14 at 10:11
  • Wait, is that supposed to say `<= 1` or `< 1`? Do you mean to report length-1 intersections as length-0? – user2357112 Mar 23 '14 at 10:39
  • It is definitely supposed to say `< 1`. As I said, was tired, and very sloppy. Apologies. – Davy Kavanagh Mar 23 '14 at 10:44
  • Nope, nope it isn't. `>1`. Jesus. What's wrong with me this weekend – Davy Kavanagh Mar 23 '14 at 14:13
  • Have you thought about putting this code on codegolf.stackexchange.com, it looks like the kind of thing people on there find fun. As an example you could invert your representation as your domain is really small 1..100, take 100 lists and in each one store the sets the number is within. The intersection testing then becomes walking through and counting the hits, so an array of 100 counters could take care of that. Doesn't have much to do with Cython, but it looks like it could be sped up even more than that using bit manipulation tricks for the set membership. – Andrew May 01 '14 at 19:48

1 Answers1

2

In the following line, the extension module name and the filename does not match actual filename.

ext_modules = [Extension("computeOverlapWithGeneList", 
                         ["computeOverlapWithGeneList.pyx"])]

Replace it with:

ext_modules = [Extension("overlapFunc",
                         ["overlapFunc.pyx"])]
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • I tried following the asciinema thingy (cool btw, haven't seen it before), and I changed the `ext_modules` line, but it didn't produce any different results. I've edited the question, as there were many mistakes. It was late. Sorry, and thanks. – Davy Kavanagh Mar 23 '14 at 10:37
  • @DavyKavanagh, Does it produce the `overlapFunc.so`? – falsetru Mar 23 '14 at 10:54
  • Yes, everything runs fine, but it's still as slow as before. – Davy Kavanagh Mar 23 '14 at 11:04
  • There is very little for Cython to optimize here. Most of the work is in testSet & targetSet[i] calls, which is taken care of by Python (actually implemented in C). Cython can anly remve the overhead from the Python interpreter here, which is already tiny. A good idea is to profile your code first. It will tell you where the bottlenecks are. Cython cannot make anything in Python run faster. Cython can integrate your Python code with C, and C can sometimes (but not always) be a lot faster than Python. But if you don't use any C in your Cython code, ypu will hardly gain anything from Cython. – Sturla Molden Mar 28 '14 at 16:35
  • Cython is actually very hard, because you must know Python and C, and be able to use C and Python simultaneously in a Python-like language. Cython might seems easy at first, but unless you know C properly you will just shoot yourself in the foot. Cython is not a static Python compiler that just magically speeds up your Python code. – Sturla Molden Mar 28 '14 at 16:43