3

What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?

e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)

*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits: 'G':00, 'A':01, 'T':10, 'C':11

*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters

*by fastest I mean not including any pre-processing/indexing time

*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.

*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.

*KMP wouldn't work well as the alphabet is small

*C code, x86 machine code would all be interesting.

Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html

Related:

Fastest way to scan for bit pattern in a stream of bits

Algorithm help! Fast algorithm in searching for a string with its partner

http://www.arstdesign.com/articles/fastsearch.html

http://en.wikipedia.org/wiki/Bitap_algorithm

Community
  • 1
  • 1
alex
  • 1,757
  • 4
  • 21
  • 32
  • (aside from the arbitrary restriction, usually indicative of homework) - Search for Boyer-More, KMP – Mitch Wheat Jan 16 '12 at 05:15
  • 6
    Why is it necessary to use bitwise operators? – Michael Mior Jan 16 '12 at 05:18
  • Is the data recorded using 2-bits per character? If not, there is no benefit to the observation that the alphabet consists of just 4 characters (or, at least, I can't think of any benefit). – Jonathan Leffler Jan 16 '12 at 05:43
  • 1
    Hmm.. if this file is packed into elements that are 2-bits long, there ought to be some way of optimizing the search somehow so that 32/64 bit compares can be used to eliminate most of the search positions, leaving only a few to be checked further by bitwise compare? Some preprocessing of one or both files? Not sure. – Martin James Jan 16 '12 at 05:50
  • Did you misuse the assembly tag or did you forget to explain what this has to do with assembly language? – Alexey Frunze Jan 16 '12 at 06:48
  • 1
    How about a 16-bit lookup table, indexed by 16-bit values from the search-string, that returns a bit-position where a match with the first 16-bits of the scan-string has started, (if any). If this returns 0-15, you know how many bits to shift the next 16-bits to continue to verify the match. – Martin James Jan 16 '12 at 10:21
  • @alex The fastest way to find a substring is not to convert your alphabet. Keep everything aligned and ascii. – Dave Jan 16 '12 at 13:56
  • @Dave - you may be right, but I'd have to see timings to be convinced. Pre-converting to ASCII blows up the file size by 4 - you may need 24GB of RAM, (to do it quickly for more than 1 search). – Martin James Jan 16 '12 at 18:00

6 Answers6

5

If you're just looking through a file, you're pretty much guaranteed to be io-bound. Use of a large buffer (~16K), and strstr() should be all you need. If the file is encoded in ascii,search just for "gcagctgaaaaca". If it actually is encoded in bits; just permute the possible accepted strings(there should be ~8; lop off the first byte), and use memmem() plus a tiny overlapping bit check.

I'll note here that glibc strstr and memmem already use Knuth-Morris-Pratt to search in linear time, so test that performance. It may surprise you.

Dave
  • 10,964
  • 3
  • 32
  • 54
3

If you first encode/compress the DNA string with a lossless coding method (e.g. Huffman, exponential Golumb, etc.) then you get a ranked probability table ("coding tree") for DNA tokens of various combinations of nucleotides (e.g., A, AA, CA, etc.).

What this means is that, once you compress your DNA:

  1. You'll probably be using fewer bits to store GCAGCTGAAAACA and other subsequences, than the "unencoded" approach of always using two bits per base.
  2. You can walk through the coding tree or table to build an encoded search string, which will usually be shorter than the unencoded search string.
  3. You can apply the same family of exact search algorithms (e.g. Boyer-Moore) to locate this shorter, encoded search string.

As for a parallelized approach, split the encoded target string up into N chunks and run the search algorithm on each chunk, using the shortened, encoded search string. By keeping track of the bit offsets of each chunk, you should be able to generate match positions.

Overall, this compression approach would be useful if you plan on doing millions of searches on sequence data that won't change. You'd be searching fewer bits — potentially many fewer, in aggregate.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
2

Boyer-More is a technique used to search for substrings in plain strings. The basic idea is that if your substring is, say, 10 characters long, you can look at the character at position 9 in the string to search. If that character is not part of your search string, you could simply start the search after that character. (If that character is, indeed, in your string, the Boyer-More algorithm use a look-up table to skip the optimal number of characters forward.)

It might be possible to reuse this idea for your packed representation of the genome string. After all, there are only 256 different bytes, so you could safely pre-calculate the skip-table.

Lindydancer
  • 25,428
  • 4
  • 49
  • 68
1

The benefit of encoding the alphabet into bit fields is compactness: one byte holds the equivalent of four characters. This is similar to some of the optimizations Google performs searching for words.

This suggests four parallel executions, each with the (transformed) search string offset by one character (two bits). A quick-and-dirty approach might be to just look for the first or second byte of the search string and then check extra bytes before and after matching the rest of the string, masking off the ends if necessary. The first search is handily done by the x86 instruction scasb. Subsequent byte matches can build upon the register values with cmpb.

wallyk
  • 56,922
  • 16
  • 83
  • 148
1

You could create a state machine. In this topic, Fast algorithm to extract thousands of simple patterns out of large amounts of text , I used [f]lex to create the state machine for me. It would require some hackery to use the 4 letter ( := two bit) alphabet, but it can be done using the same tables as generated by [f]lex. (you could even create your own fgetc() like function which extracts two bits at a time from the input stream, and keeps the other six bits for consecutive calls. Pushback will be a bit harder, but not undoable).

BTW: I seriously doubt if there is any gain in compressing the data to two bits per nucleotide, but that is a different matter.

Community
  • 1
  • 1
wildplasser
  • 43,142
  • 8
  • 66
  • 109
1

Okay, given your parameters, the problem isn't that hard, just not one you'd approach like a traditional string search problem. It more resembles a database table-join problem, where the tables are much larger than RAM.

  • select a good rolling hash function aka buzhash. If you have billions of strings, you're looking for a hash with 64-bit values.

  • create a hash table based on each 127-element search string. The table in memory only needs to store (hash,string-id), not the whole strings.

  • scan your very large target string, computing the rolling hash and looking up each value of the hash in your table. Whenever there's a match, write the (string-id, target-offset) pair to a stream, possibly a file.

  • reread your target string and the pair stream, loading search strings as needed to compare them against the target at each offset.

I am assuming that loading all pattern strings into memory at once is prohibitive. There are ways to segment the hash table into something that is larger than RAM but not a traditional random-access hash file; if you're interested, search for "hybrid hash" and "grace hash", which are more common in the database world.

I don't know if it's worth your while, but your pair stream gives you the perfect predictive input to manage a cache of pattern strings -- Belady's classic VM page replacement algorithm.

Mischa
  • 2,240
  • 20
  • 18