I am building an example of De Bruijn Assembly for assembling a genome (or any string) by getting each possible word of n-length of the string, then finding the correct path through the reads by comparing the end-pieces of each node. which accepts as arguments a sequence and a size for each read of the sequence, it will first get collect all of the reads into an array of size [kmer_size][3], the [3] indexes 0=full read 1=all but far right char of read 2= all but far left char of read.
the portion that assembles the reads works as expected, it is separated into a function and those reads are printed correctly.
I then create a unordered_map using char* as keys and another map as value, that map is keyed by char* and valued by int.
what should happen is it should check to see if the section of the read excluding the leftmost char matches the same section of each other read, if they match, take the right-excluding part of the matching read and create a new entry in the internal map that is keyed by the left-excluding part of the read you are testing and increment the value of that element by 1.
if you look at the output you will see that when i, in a separate loop, print the contents of the nested maps, there are duplicate entries in both the outer and inner map. the char* keys that have the same string values are not putting items into the same bucket they are instead creating a new bucket with the same name. I assume that this is because char* is actually a string value but an address and they are pointing to different addresses.
How would I modify this code to allow my maps to have only 1 bucket for each string
#include<stdio.h>
#include<string.h>
#include<iostream>
#include<bits/stdc++.h>
#include<unordered_map>
using namespace std;
void extractReads(char* kmers[][3], int num_kmers, int kmer_size, char* seq);
int main(int nargs, char* args[]){
if(nargs!=3){
cout<<"INVALID ARGUMENTS"<<endl;
cout<<"dba <kmer_size> <sequence>"<<endl;
}
char* seq = args[2];
int kmer_size = atoi(args[1]);
int num_kmers = strlen(seq)-(kmer_size -1);
char* kmers[num_kmers][3];
unordered_map<char*, unordered_map<char*, int> > nodes;
extractReads(kmers, num_kmers, kmer_size, seq);
for(int i=0; i< num_kmers; i++)
{
for(int j=0; j<num_kmers; j++)
{
if(strcmp(kmers[i][2], kmers[j][2]) == 0 )
{
// cout<<" match"<<endl;
nodes[kmers[i][2]][kmers[j][1]]++;
}
}
}
for(auto node: nodes)
{
cout<<node.first<<endl;
for (auto n: node.second)
{
cout<<" "<<n.first<<" "<<n.second<<endl;
}
}
return 0;
}
void extractReads(char* kmers[][3], int num_kmers, int kmer_size, char* seq)
{
cout<<"READS"<<endl<<"==========="<<endl;
for (int i=0; i<num_kmers; i++){
kmers[i][0] = (char*) malloc(kmer_size);
kmers[i][1] = (char*) malloc(kmer_size-1);
kmers[i][2] = (char*) malloc(kmer_size-1);
strncpy(kmers[i][0], seq+i, kmer_size);
strncpy(kmers[i][1], kmers[i][0], kmer_size-1);
strncpy(kmers[i][2], kmers[i][0]+1, kmer_size-1);
cout<<kmers[i][0]<<" : "<<kmers[i][1]<<" "<<kmers[i][2]<<endl;
}
cout<<"==========="<<endl;
}