What is the fastest (parallel?) way to find a substring in a very long string using bitwise operators?
e.g. find all positions of "GCAGCTGAAAACA" sequence in a human genome http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/hg18.2bit (770MB)
*the alphabet consists of 4 symbols ('G','C',T,'A') represented using 2 bits: 'G':00, 'A':01, 'T':10, 'C':11
*you can assume the query string (the shorter one) is fixed in length, e.g. 127 characters
*by fastest I mean not including any pre-processing/indexing time
*the file is going to be loaded into memory after pre-processing, basically there will be billions of short strings to be searched for in a larger string, all in-memory.
*bitwise because I'm looking for the simplest, fastest way to search for a bit pattern in a large bit array and stay as close as possible to the silicon.
*KMP wouldn't work well as the alphabet is small
*C code, x86 machine code would all be interesting.
Input format description (.2bit): http://jcomeau.freeshell.org/www/genome/2bitformat.html
Related:
Fastest way to scan for bit pattern in a stream of bits
Algorithm help! Fast algorithm in searching for a string with its partner