-2

I searched for ways to count how often a string of letters appears in a .txt file and found (among others) this thread: Count the number of times each word occurs in a file

which deals with the problem by counting words (which are separated by spaces).

However, I need to do something slightly different: I have a .txt file containing billions of letters without any formatting (no spaces, no puntuation, no line breaks, no hard returns, etc.), just a loooooong line of the letters a, g, t and c (i.e: a DNA sequence ;)).

Now I want to write a program that goes through the entire sequence and count how often each possible continuous sequence of 9 letters appears in that file.

Yes, there are 4^9 possible combinations of 9-letter 'words' made up of the characters A, G, T and C, but I only want to output the top 1000.

Since there are no spaces or anything, I would have to go through the file letter-by-letter and examine all the 9-letter 'words' that appear, i.e.:

ATAGAGCTAGATCCCTAGCTAGTGACTA

contains the sequences:

ATAGAGCTA, TAGAGCTAG, AGAGCTAGA, etc.

I hope you know what I mean, it's hard for me to describe the same in English, since it is not my native language.

Best regards and thank you all in advance!

unlimit
  • 3,672
  • 2
  • 26
  • 34
rokyo
  • 35
  • 4

2 Answers2

2

Compared to billions, 2^18, or 256k seems suddenly small. The good news is that it means your histogram can be stored in about 1 MB of data. A simple approach would be to convert each letter to a 2-bit representation, assuming your file only contains AGCT, and none of the RYMK... shorthands and wildcards.

This is what this 'esquisse' does. It packs the 9 bytes of text into an 18 bit value and increments the corresponding histogram bin. To speed up[ conversion a bit, it reads 4 bytes and uses a lookup table to convert 4 glyphs at a time.

I don't know how fast this will run, but it should be reasonable. I haven't tested it, but I know it compiles, at least under gcc. There is no printout, but there is a helper function to unpack sequence packed binary format back to text.

It should give you at least a good starting point

#include <vector>
#include <array>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <exception>

namespace dna {

    // helpers to convert nucleotides to packed binary form
    enum Nucleotide : uint8_t { A, G, C, T };

    uint8_t simple_lut[4][256] = {};

    void init_simple_lut()
    {
        for (size_t i = 0 ; i < 4; ++i)
        {
            simple_lut[i]['A'] = A << (i * 2);
            simple_lut[i]['C'] = C << (i * 2);
            simple_lut[i]['G'] = G << (i * 2);
            simple_lut[i]['T'] = T << (i * 2);
        }
    }


    uint32_t pack4(const char(&seq)[4])  
    {
        return simple_lut[0][seq[0]] 
             + simple_lut[1][seq[1]]
             + simple_lut[2][seq[2]]
             + simple_lut[3][seq[3]];
    }

    // you can use this to convert the historghrtam 
    // index back to text.
    std::string hist_index2string(uint32_t n)
    {
        std::string result;
        result.reserve(9);
        for (size_t i = 0; i < 9; ++i, n >>= 2)
        {
            switch (n & 0x03)
            {
            case A: result.insert(result.begin(), 'A'); break;
            case C: result.insert(result.begin(), 'C'); break;
            case G: result.insert(result.begin(), 'G'); break;
            case T: result.insert(result.begin(), 'T'); break;
            default:
                throw std::runtime_error{ "totally unexpected error while unpacking index !!" };
            }
        }
        return result;
    }
}

int main(int argc, const char**argv, const char**)
{
    if (argc < 2)
    {
        std::cerr << "Usage: prog_name <input_file> <output_file>\n";
        return 3;
    }

    using dna::pack4;
    dna::init_simple_lut();
    std::vector<uint32_t> result;

    try
    {
        result.resize(1 << 18);
        std::ifstream ifs(argv[1]);

        // read 4 bytes at a time, convert to packed bits representation
        // then rotate in bits 2 by 2 in our 18 bits buffer.
        // increment coresponding bin by 1
        const uint32_t MASK{ (1 << 19) - 1 };
        const std::streamsize RB{ 4 };
        uint32_t accu{};
        uint32_t packed{};

        // we need to load at least 9 bytes to 'prime' the engine
        char buffer[4];
        ifs.read(buffer, RB);
        accu = pack4(buffer) << 8;
        ifs.read(buffer, RB);
        accu |= pack4(buffer);
        if (ifs.gcount() != RB)
        {
            throw std::runtime_error{ " input file is too short "};
        }

        ifs.read(buffer, RB);
        while (ifs.gcount() != 0)
        {
            packed = pack4(buffer);

            for (size_t i = 0; i < (size_t)ifs.gcount(); ++i)
            {
                accu <<= 2;
                accu |= packed & 0x03;
                packed >>= 2;

                accu &= MASK;
                ++result[accu];
            }
            ifs.read(buffer.pch, RB);
        }
        ifs.close();

        // histogram is compiled. store data to another file...
        // you can crate a secondary table of { index and count } 
        // it's only 5MB long and use partial_sort to extract the first 
        // 1000.
    }
    catch(std::exception& e)
    {
        std::cerr << "Error \'" << e.what() <<  "\' while reading file.\n";
        return 3;
    }

   return 0;
}

This algorithm can be adapted to run on multiple threads, by opening the file in multiple streams with the proper share configuration, and running the loop on bits of the file. Care must be taken for the 16 bytes seams at the end of the process.

If running in parallel, the inner loop is so short that it may be a good idea to provide each thread with its own histogram and merge the results at the end, otherwise, the locking overhead would slow things quite a bit.

[EDIT] Silly me I had the packed binary lookup wrong.

[EDIT2] replaced the packed lut with a faster version.

Michaël Roy
  • 6,338
  • 1
  • 15
  • 19
  • Thanks a lot! This is probably way too advanced for me but it works great! ;) – rokyo Aug 21 '17 at 10:30
  • The code basically creates a truly unique hash value as fast as it can and uses that to generate the histogram. For billions of inputs, you will want to adapt for parallel processing. I'm glad to hear it works, because I hadn't tested it. Let me know if you have problems with extracting the top 1000. – Michaël Roy Aug 21 '17 at 13:41
1

This works for you,

#include <iostream>
#include <fstream>
#include <string>

using namespace std;
    int main () 
    {
    string line;
    int sum=0;
    ifstream inData ;
    inData.open("countletters.txt");
    while(!inData.eof())
    {
    getline(inData,line);
        int numofChars= line.length();
        for (unsigned int n = 0; n<line.length();n++)
        { 
        if (line.at(n) == ' ')
        {
        numofChars--;
        }
        }
    sum=numofChars+sum;
    }
    cout << "Number of characters: "<< sum << endl;
        return 0 ;
    }
Sanjay Kumaar
  • 690
  • 7
  • 17
  • You may want to change `int numofChars= line.length();` to `std::string::size_type numofChars= line.length();` since `std::string::length()` may overflows `int`. – muXXmit2X Aug 18 '17 at 12:15
  • i am using `using namespace std`. – Sanjay Kumaar Aug 18 '17 at 12:17
  • That is correct however has nothing to do with my comment. `std::string::size_type` is most likely `std::size_t` which is usually a 64 bit unsigned integer data type. `int` however is (usually) only 32 bit big and also signed. Therefore your program may result in an integer overflow. – muXXmit2X Aug 18 '17 at 12:18
  • ohk..but next time carefully comment anywhere.@muXXmit2X – Sanjay Kumaar Aug 18 '17 at 12:21
  • 1
    @SanjayKumaar muXXmit2X told you that an int variable may overflow if you try to store the length of a large string and recommended you to use a variable of type string::size_type instead of int to prevent that kind of behavior. – theVoid Aug 18 '17 at 12:26
  • 2
    Thank you for your reply! But I think it only counts the total number of characters in the file, correct? I need to find sequences of 9 letters (i.e. ATGCCGATC) and how often they occur in the file. For example, since it is DNA (the mouse genome, to be exact), the sequence AAAAAAAAA will be rare but the sequence CAGGTAAGT (which is a conserved splice donor site) appears very often in a genome. ;) – rokyo Aug 18 '17 at 12:29