Background: I'm rockie in C++
Input file: 1 millon lines like to
FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN 97 genome 96003934 24 118M4D11M = 96004135 0 GCA....ACG P\..GW^EO AS:i:-28 XN:i:0 XM:i:2 XO:i:1 XG:i:4 NM:i:6 MD:Z:54G53T9^TACA11 YT:Z:UP
Output expected
96003934 98.31
Explanation output
Column 4: 96003934
Column 18: MD:Z:54G53T9^TACA11
match = 54+53+9 = 116
mismatch = count_letter(54G53T9) = 2
id = 116*100 / (116+2) = 98.30508474576272
awk script
awk '{
split($18,v,/[\^:]/);
nmatch = split(v[3],vmatch, /[^0-9]/);
cmatch=0;
for(i=1; i<=nmatch; i++) cmatch+=vmatch[i];
printf("%s"OFS"%.2f\n", $4, cmatch*100/(cmatch+nmatch-1));
}' file.sam
C++, I thought would be faster
#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <algorithm>
#include <iterator>
#include <iomanip>
using namespace std;
int main(){
string line;
while(getline(cin, line)){
istringstream iss(line);
vector<string> columns;
copy(istream_iterator<string>(iss), //Split line by spaces
istream_iterator<string>(),
back_inserter(columns));
//I extract information from column 18
int start = columns[17].find_last_of(':');
int end = columns[17].find_first_of('^');
string smatch = columns[17].substr(start+1, end-start-1);
// I get for example "54G53T9"
replace( smatch.begin(), smatch.end(), 'A', ' ');
replace( smatch.begin(), smatch.end(), 'C', ' ');
replace( smatch.begin(), smatch.end(), 'G', ' ');
replace( smatch.begin(), smatch.end(), 'T', ' ');
// I get for example "54 53 9"
istringstream iss_sum(smatch);
int n=0, sum=0, count=0;
while(iss_sum >> n){
sum += n;
count++;
}
cout << columns[3] << ' ' << fixed << setprecision(2)
<< (float)sum*100 / (sum+count-1) << endl;
}
}
Benchmark
with 1 millon of lines in input ....
- awk: 0m6.102s
- C++: 0m15.814s
Question
what am I doing wrong so that C++
works slowly ? ..... can I improve C++ program? if yes, how? ..... should I write in C
? ....
thank in advance