0

So I have several very large files which represent each position in the human genome. Both of these files are binary masks for a certain type of "score" for each position in the genome and I am interested in getting a new mask where both scores are "1" i.e. the intersection of the two masks.

For example:

File 1:          00100010101
File 2:          11111110001
Desired output:  00100010001

In python, it is really fast to read these big files (they contain between 50-250 million characters) into strings. However, I can't just & the strings together. I CAN do something like

bin(int('0001',2) & int('1111', 2))

but is there a more direct way that doesn't require that I pad in the extra 0's and convert back to a string in the end?

Kevin
  • 74,910
  • 12
  • 133
  • 166
isosceleswheel
  • 1,516
  • 12
  • 20

1 Answers1

0

I think the conversion to builtin integer types for the binary-and operation is likely to make it much faster than working character by character (because Python's int is written in C rather than Python). I'd suggest working on each line of your input files, rather than the whole multi-million-character strings at once. The binary-and operation doesn't require any carrying, so there's no issue working with each line separately.

To avoid awkward string operations to pad the result out the the right length, you can the str.format method to convert your integer to a binary string of the right length in one go. Here's an implementation that writes the output out to a new file:

import itertools

with open(filename1) as in1, open(filename2) as in2, open(filename3, "w") as out:
    for line1, line2 in itertools.izip(in1, in2):
        out.write("{0:0{1}b}\n".format(long(line1, 2) & long(line2, 2), len(line1) - 1))

I'm using one of the neat features of the string formatting mini-language to use a second argument to pass a desired length for the converted number. If you can rely upon the lines always having exactly 50 binary digits (including at the end of the files), you could hard code the length with {:050b} rather than computing it from the length of an input line.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • Isn't this likely to exceed the maximum value allowed for `int`s? On my system that is 33 bits so I think I would have to use `long` instead. – Stuart Dec 09 '15 at 20:34
  • Ah, for Python 2 you're probably right. I was writing for Python 3 where there's just one unified `int` type that can handle integers of any size. You probably also want to use `itertools.izip` instead of the builtin `zip` to avoid reading all the lines of the files into memory up front (Python 3's `zip` is a generator). – Blckknght Dec 09 '15 at 20:35
  • Actually, you don't need to use `long` explicitly, the `int` constructor will return one if necessary. I've edited to use `izip` though. – Blckknght Dec 09 '15 at 20:38
  • I'm _guessing_ that processing in chunks that keep you within the max int (e.g. 32) will probably be more efficient. – mgilson Dec 09 '15 at 20:41
  • @mgilson: That's certainly possible. My (possibly naive) assumption is that the logic of `int`/`long` is probably at least as fast as any code I could write in Python to split up the input more finely. – Blckknght Dec 09 '15 at 20:52