0

I am building an example of De Bruijn Assembly for assembling a genome (or any string) by getting each possible word of n-length of the string, then finding the correct path through the reads by comparing the end-pieces of each node. which accepts as arguments a sequence and a size for each read of the sequence, it will first get collect all of the reads into an array of size [kmer_size][3], the [3] indexes 0=full read 1=all but far right char of read 2= all but far left char of read.

the portion that assembles the reads works as expected, it is separated into a function and those reads are printed correctly.

I then create a unordered_map using char* as keys and another map as value, that map is keyed by char* and valued by int.

what should happen is it should check to see if the section of the read excluding the leftmost char matches the same section of each other read, if they match, take the right-excluding part of the matching read and create a new entry in the internal map that is keyed by the left-excluding part of the read you are testing and increment the value of that element by 1.

if you look at the output you will see that when i, in a separate loop, print the contents of the nested maps, there are duplicate entries in both the outer and inner map. the char* keys that have the same string values are not putting items into the same bucket they are instead creating a new bucket with the same name. I assume that this is because char* is actually a string value but an address and they are pointing to different addresses.

How would I modify this code to allow my maps to have only 1 bucket for each string

#include<stdio.h>
#include<string.h>
#include<iostream>
#include<bits/stdc++.h> 
#include<unordered_map>

using namespace std;

void extractReads(char* kmers[][3], int num_kmers, int kmer_size, char* seq);

int main(int nargs, char* args[]){
    if(nargs!=3){
        cout<<"INVALID ARGUMENTS"<<endl;
        cout<<"dba <kmer_size> <sequence>"<<endl;
    }
    char* seq = args[2];
    int kmer_size = atoi(args[1]);
    int num_kmers = strlen(seq)-(kmer_size -1);
    char* kmers[num_kmers][3];
    unordered_map<char*, unordered_map<char*, int> > nodes;

    extractReads(kmers, num_kmers, kmer_size, seq);

    for(int i=0; i< num_kmers; i++)
    {
        for(int j=0; j<num_kmers; j++)
        {
            if(strcmp(kmers[i][2], kmers[j][2]) == 0 )
            {
                // cout<<" match"<<endl;
                nodes[kmers[i][2]][kmers[j][1]]++;
            }

        }
    }

    for(auto node: nodes)
    {
        cout<<node.first<<endl;
        for (auto n: node.second)
        {
            cout<<" "<<n.first<<" "<<n.second<<endl;
        }
    }

    return 0;
}



void extractReads(char* kmers[][3], int num_kmers, int kmer_size, char* seq)
{
    cout<<"READS"<<endl<<"==========="<<endl;
    for (int i=0; i<num_kmers; i++){
        kmers[i][0] = (char*) malloc(kmer_size);
        kmers[i][1] = (char*) malloc(kmer_size-1);
        kmers[i][2] = (char*) malloc(kmer_size-1);
        strncpy(kmers[i][0], seq+i, kmer_size);
        strncpy(kmers[i][1], kmers[i][0], kmer_size-1);
        strncpy(kmers[i][2], kmers[i][0]+1, kmer_size-1);
        cout<<kmers[i][0]<<" : "<<kmers[i][1]<<" "<<kmers[i][2]<<endl;
    }    
    cout<<"==========="<<endl;

}
  • `char* kmers[num_kmers][3];` -- This is not valid C++. Arrays in C++ must have their sizes denoted by constant expressions, not runtime variables such as `num_kmers`. Use `std::vector`. Also, if you want to use strings as keys, use `std::string` as the key, not `char *`. Using `std::string` would also remove the need for `malloc`. – PaulMcKenzie Feb 01 '19 at 02:35
  • Unrelated: The fact that you have other library headers along with `#include` suggest you do not understand what `#include` does. The actual goal of that header is pretty neat, but you're misusing. It, along with everything else in `bits`, should never be included directly. Some reading on that: [Why should I not #include ?](https://stackoverflow.com/questions/31816095/why-should-i-not-include-bits-stdc-h) – user4581301 Feb 01 '19 at 03:01
  • Related: see [C++ unordered_map with char* as key](https://stackoverflow.com/questions/20649864/), [Using const char* as key for map/unordered_map](https://stackoverflow.com/questions/50608392/), and [C++ unordered_map with char* key produces unexpected behavior](https://stackoverflow.com/questions/17506759/) – Remy Lebeau Feb 01 '19 at 06:57

1 Answers1

2

Your code has many problems (as comments to the question suggest), I'll list them at the end of the answer since they are unrelated to the core of the question.

The problematic line, as you suspected, is the line:

unordered_map<char*, unordered_map<char*, int> > nodes

As you said

this is because char* is actually a string value but an address and they are pointing to different addresses.

In other words, your strings (kmers) are compared as pointers. If two char * objects are allocated with two different malloc calls, then they have a different address. The unordered_map compares only the address, not the set of characters that reside at the address.

The solution is to start using C++ strings rather than C zero-terminated strings:

std::unordered_map<std::string, std::unordered_map<std::string, int> > nodes

This will fix other problems that your code has:

  1. Your code has a memory leak. It allocates memory with malloc and never frees it. Using std::string cures the problem.
  2. kmers tend to be relatively short strings (most of them below 12 letters). std::string is optimized exactly for this case and avoids heap memory altogether for these strings. The code will run much faster with std::string by avoiding unnecessary heap allocations.

Another option, which is less desirable, is to provide your own Hash an KeyEqual functions:

class cstr_hash
{
   public:
      std::size_t operator()(const char *cstr) const
      {
          std::size_t hash = 5381;
          for ( ; *cstr != '\0' ; ++cstr)
             hash = (hash * 33) + *cstr;
          return hash;
      }
};
class cstr_eq
{
   public:
     using value_type = const char*;
     bool operator()(const char* a, const char *b) const
     { return strcmp(a, b) == 0; }
};

and then use the map:

 std::unordered_map<const char *, int, cstr_hash, cstr_eq> nodes;

But this approach is not advisable since it makes it harder to avoid memory leaks and does not optimize short strings like std::string does.


Some other unrelated issues your code has:
 char* kmers[num_kmers][3];

This is not C++. Most compilers support VLA (variable length array), but it is not part of the standard. Better use std::vector<std::string>.

Memory leaks. You allocate strings with malloc, and never free them. Better use std::string and pass it around so that malloc is no longer used in the code.

unordered_map is usually less efficient that std::map for containers with less than 10,000 elements. With genome data there is some chance you are hitting the case where std::unordered_map is worth it, but I'd test this (especially for the inner container).

Another problem is using std::endl, which can make your code run 2-10 times slower. You should use '\n' instead of endl. What endl does is flush the output at the end of the line. The additional system call makes a big difference in many cases, with respect to performance. Of course, if this is just debugging code then it does not matter.

Michael Veksler
  • 8,217
  • 1
  • 20
  • 33