The major problem I see here is that you're writing everything out into a file. There's no point in doing this. The large output file you create is very redundant, and loading it back in when you do your analysis isn't helpful.
After you've loaded the file initially, every window you're interested in looking at is a[x:x+100]
for some x
. You shouldn't need to actually generate those windows explicitly at all: there shouldn't be any benefit to doing so. Go through, and generate those matrices from each window of a directly.
If you really do need the whole thing, generate it as a numpy array. I additionally, if I'm not using any degenerate base codes, convert the sequence into uint8s using 0,1,2,3 for A, C, G, T. This can help to speed up things, especially if you need to take complements at any point, which can be done as simple fiddling around with bits.
Numpy can generate the array quite efficiently using stride_tricks
, as noted in this blog post:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
b = numpy.array([x for x in a], dtype=numpy.character)
rolling_window(b,100)
Or, converting to ints:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
handle = open("U00096.2.fas.txt", "r")
a = handle.readlines()[1:]
a = ''.join([x.strip() for x in a])
conv = {'a': 0, 'c': 1, 'g': 2, 't': 3}
b = numpy.array([conv[x] for x in a], dtype=numpy.uint8)
rolling_window(b,100)
This code is around ten times faster than yours on my machine.