Compared to billions, 2^18, or 256k seems suddenly small. The good news is that it means your histogram can be stored in about 1 MB of data. A simple approach would be to convert each letter to a 2-bit representation, assuming your file only contains AGCT, and none of the RYMK... shorthands and wildcards.
This is what this 'esquisse' does. It packs the 9 bytes of text into an 18 bit value and increments the corresponding histogram bin. To speed up[ conversion a bit, it reads 4 bytes and uses a lookup table to convert 4 glyphs at a time.
I don't know how fast this will run, but it should be reasonable. I haven't tested it, but I know it compiles, at least under gcc. There is no printout, but there is a helper function to unpack sequence packed binary format back to text.
It should give you at least a good starting point
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <exception>
namespace dna {
// helpers to convert nucleotides to packed binary form
enum Nucleotide : uint8_t { A, G, C, T };
uint8_t simple_lut[4][256] = {};
void init_simple_lut()
{
for (size_t i = 0 ; i < 4; ++i)
{
simple_lut[i]['A'] = A << (i * 2);
simple_lut[i]['C'] = C << (i * 2);
simple_lut[i]['G'] = G << (i * 2);
simple_lut[i]['T'] = T << (i * 2);
}
}
uint32_t pack4(const char(&seq)[4])
{
return simple_lut[0][seq[0]]
+ simple_lut[1][seq[1]]
+ simple_lut[2][seq[2]]
+ simple_lut[3][seq[3]];
}
// you can use this to convert the historghrtam
// index back to text.
std::string hist_index2string(uint32_t n)
{
std::string result;
result.reserve(9);
for (size_t i = 0; i < 9; ++i, n >>= 2)
{
switch (n & 0x03)
{
case A: result.insert(result.begin(), 'A'); break;
case C: result.insert(result.begin(), 'C'); break;
case G: result.insert(result.begin(), 'G'); break;
case T: result.insert(result.begin(), 'T'); break;
default:
throw std::runtime_error{ "totally unexpected error while unpacking index !!" };
}
}
return result;
}
}
int main(int argc, const char**argv, const char**)
{
if (argc < 2)
{
std::cerr << "Usage: prog_name <input_file> <output_file>\n";
return 3;
}
using dna::pack4;
dna::init_simple_lut();
std::vector<uint32_t> result;
try
{
result.resize(1 << 18);
std::ifstream ifs(argv[1]);
// read 4 bytes at a time, convert to packed bits representation
// then rotate in bits 2 by 2 in our 18 bits buffer.
// increment coresponding bin by 1
const uint32_t MASK{ (1 << 19) - 1 };
const std::streamsize RB{ 4 };
uint32_t accu{};
uint32_t packed{};
// we need to load at least 9 bytes to 'prime' the engine
char buffer[4];
ifs.read(buffer, RB);
accu = pack4(buffer) << 8;
ifs.read(buffer, RB);
accu |= pack4(buffer);
if (ifs.gcount() != RB)
{
throw std::runtime_error{ " input file is too short "};
}
ifs.read(buffer, RB);
while (ifs.gcount() != 0)
{
packed = pack4(buffer);
for (size_t i = 0; i < (size_t)ifs.gcount(); ++i)
{
accu <<= 2;
accu |= packed & 0x03;
packed >>= 2;
accu &= MASK;
++result[accu];
}
ifs.read(buffer.pch, RB);
}
ifs.close();
// histogram is compiled. store data to another file...
// you can crate a secondary table of { index and count }
// it's only 5MB long and use partial_sort to extract the first
// 1000.
}
catch(std::exception& e)
{
std::cerr << "Error \'" << e.what() << "\' while reading file.\n";
return 3;
}
return 0;
}
This algorithm can be adapted to run on multiple threads, by opening the file in multiple streams with the proper share configuration, and running the loop on bits of the file. Care must be taken for the 16 bytes seams at the end of the process.
If running in parallel, the inner loop is so short that it may be a good idea to provide each thread with its own histogram and merge the results at the end, otherwise, the locking overhead would slow things quite a bit.
[EDIT] Silly me I had the packed binary lookup wrong.
[EDIT2] replaced the packed lut with a faster version.