-1

Background: I'm rockie in C++

Input file: 1 millon lines like to

FCC5G2YACXX:5:1101:1224:2059#NNNNNNNN   97  genome  96003934    24  118M4D11M   =   96004135    0   GCA....ACG  P\..GW^EO   AS:i:-28    XN:i:0  XM:i:2  XO:i:1  XG:i:4  NM:i:6  MD:Z:54G53T9^TACA11 YT:Z:UP

Output expected

96003934 98.31

Explanation output

Column 4: 96003934

Column 18: MD:Z:54G53T9^TACA11

match = 54+53+9 = 116

mismatch = count_letter(54G53T9) = 2

id = 116*100 / (116+2) = 98.30508474576272

awk script

awk '{
    split($18,v,/[\^:]/); 
    nmatch = split(v[3],vmatch, /[^0-9]/); 
    cmatch=0; 
    for(i=1; i<=nmatch; i++) cmatch+=vmatch[i]; 
    printf("%s"OFS"%.2f\n", $4, cmatch*100/(cmatch+nmatch-1));
}' file.sam

C++, I thought would be faster

#include <iostream>
#include <string>
#include <vector>
#include <sstream>
#include <algorithm>
#include <iterator>
#include <iomanip>

using namespace std;

int main(){
  string line;
  while(getline(cin, line)){
    istringstream iss(line);
    vector<string> columns;
    copy(istream_iterator<string>(iss),    //Split line by spaces
         istream_iterator<string>(),
         back_inserter(columns));
    //I extract information from column 18
    int start = columns[17].find_last_of(':');
    int end = columns[17].find_first_of('^');
    string smatch = columns[17].substr(start+1, end-start-1);
    // I get for example "54G53T9"
    replace( smatch.begin(), smatch.end(), 'A', ' ');
    replace( smatch.begin(), smatch.end(), 'C', ' ');
    replace( smatch.begin(), smatch.end(), 'G', ' ');
    replace( smatch.begin(), smatch.end(), 'T', ' ');
    // I get for example "54 53 9"
    istringstream iss_sum(smatch);
    int n=0, sum=0, count=0;
    while(iss_sum >> n){
      sum += n;
      count++;
    }
    cout << columns[3] << ' ' << fixed << setprecision(2) 
         << (float)sum*100 / (sum+count-1) << endl;
  }
}

Benchmark

with 1 millon of lines in input ....

  • awk: 0m6.102s
  • C++: 0m15.814s

Question

what am I doing wrong so that C++ works slowly ? ..... can I improve C++ program? if yes, how? ..... should I write in C ? ....

thank in advance

Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
  • 2
    You expect your program that you wrote in probably a couple days to be faster than a 38 year old piece of software? – Ryan Sep 23 '15 at 22:37
  • @self ja ja ja, if maybe that's my problem ..... I thought, I could be a problem with my handling of `C++` – Jose Ricardo Bustos M. Sep 23 '15 at 22:40
  • 3
    Regarding the C++ code, there are a few variables declared inside the while loop (like the vector), this is constant memory allocation. If you can move some out of the loop and reuse them then it may speed things up. – proycon Sep 23 '15 at 22:41
  • maybe prejudice with scripted and compilate .... – Jose Ricardo Bustos M. Sep 23 '15 at 22:42
  • 5
    C++ is not automagically faster. I think you've got some performance bottlenecks in your code here, like how you call replace four times, each time scanning the entire string, and a whole lot of other repeated scanning and general flailing. Why not load in a good regular expression library? If you need absolute speed, you should write a state machine that iterates character by character. – tadman Sep 23 '15 at 22:42
  • @tadman thanks a lot .... "state machine" is out of my background, I do not know how to start ..... I belive, I copy a lot of string (user2333757 response) and it's expensive ..... jokes aside, thanks for guiding me – Jose Ricardo Bustos M. Sep 23 '15 at 22:56
  • c code usually run faster then c++. and your code is not so efficient. For example: you have 4 replaces, you should try to use 1 loop instead, with `cur_pos = find_first_of("ACGT", cur_pos);` with all the options. – SHR Sep 23 '15 at 22:59
  • The first thing to do is to move the first `istringstream` outside of the loop (reset and clear it inside the loop each time). Then, stop using the second instringstream. It is one of the slowest ways to do int-string conversion. See [here](http://stackoverflow.com/questions/16826422/c-most-efficient-way-to-convert-string-to-int-faster-than-atoi) or [here](http://stackoverflow.com/questions/194465/how-to-parse-a-string-to-an-int-in-c). `cout` is probably slower than `printf`. Make sure you use `-O3`. Also, the ACGT replacement seems redundant, you could leave that out and just read ints – M.M Sep 23 '15 at 23:24
  • where is the ACGT bit in the awk program? – M.M Sep 23 '15 at 23:25
  • @M.M `nmatch = split(v[3],vmatch, /[^0-9]/); ` split with regular expression `/[^0-9]/` – Jose Ricardo Bustos M. Sep 23 '15 at 23:28
  • @M.M thanks for advices .... in answer my solution improve – Jose Ricardo Bustos M. Sep 24 '15 at 00:14
  • performance questions getting much love ;-) <3 thanks a lot everybody – Jose Ricardo Bustos M. Sep 24 '15 at 02:54
  • 2
    Too much memory allocations. They are slow. Try to use existing memory buffers, do not allocate at each step of the loop. – Denis Cheremisov Jul 22 '17 at 20:49

3 Answers3

5

C++ iostreams don't really provide a good way of checking that a column exists in some input, but otherwise ignoring it. C++ iostreams have an ignore, but it doesn't fit this particular case very well, so it probably won't help.

That being the case, I'd at least consider using scanf instead, possibly something on this general order:

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <string>
#include <iostream>
#include <numeric>

int main() {
    char column4[256];
    char column17[256];

    while (2 == scanf("%*s %*s %*s %255s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %*s %255s %*s", column4, column17)) {
        char *beg = strrchr(column17, ':') + 1;
        char *end = strchr(column17, '^');

        *end = '\0';

        int nums[5];

        int count = sscanf(beg, "%d%*[A-Z]%d%*[A-Z]%d%*[A-Z]%d%*[A-Z]%d", nums, nums + 1, nums + 2, nums + 3, nums + 4);


        int sum = std::accumulate(nums, nums + count, 0);

        double result = (sum*100.0) / (sum + count-1);
        printf("%s %2.2f\n", column4, result);
    }
}

For the moment, this assumes (perhaps incorrectly, but I had to guess at something) that the 17th (or, I count it as 18th, but whatever) column can be ignored from the beginning up to the last colon (:). Then there's some arbitrary number of repetitions of number, then letter, another number, another letter, and so on (presumed for the moment to start and end with numbers). For the moment, I've allowed for up to 5 numbers, but allowing more would be trivial. Allowing for more variation in the pattern might take a little more work (depending on what sort of variation can happen.

For a little more speed improvement, you could use a somewhat larger input buffer, something like this:

setvbuf(stdin, NULL, _IOFBF, 65536);

You need/want to do this before reading anything, so it would go before the while loop. Exactly how much good this will do (if any) seems to vary, but it's easy enough to do that it's worth at least trying it to see if it makes any difference.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • thanks a lot ..... your time with 1 millon of line is 0m1.578s ..... I also improve my code (view answers) ..... I belive `scanf` does a big difference – Jose Ricardo Bustos M. Sep 24 '15 at 00:12
  • Note: 1.5s with first algorithm – Jose Ricardo Bustos M. Sep 24 '15 at 00:16
  • 0m1.671s second algorithm – Jose Ricardo Bustos M. Sep 24 '15 at 00:19
  • @JoseRicardoBustosM.: I have added one more bit that might give a little more improvement--though in all honesty, if you're already in the 1.x second range at processing a file of ~200 megabytes, I'd be a little surprised to see a really huge improvement (but I'm seeing another ~20% or so, which is probably worthwhile, considering it's only one more line of code). – Jerry Coffin Sep 24 '15 at 00:48
  • with `setvbuf(stdin, NULL, _IOFBF, 65536);` exits improvement, this is little difference .... may be, I still do not use it well ..... I'll read about it .... once again thank a lot – Jose Ricardo Bustos M. Sep 24 '15 at 01:21
2

My code improve, time 8s

#include <iostream>
#include <string>
#include <iomanip>
#include <stdio.h>

using namespace std;

int main(){
  string record, col4;
  int sum, count, c_garbage, i, n;
  char garbage;
  while(cin >> record){
    if(i%19 == 3) col4 = record;
    else if(i%19 == 16){
      sum = 0;
      count = 0;
      c_garbage = 0;
      while(1){
        if(c_garbage == 2){
          cin >> n;
          sum += n;
          count++;
        }
        cin >> garbage;
        if(garbage==':') c_garbage++;
        else if(garbage=='^') break;
      }
      printf("%s %2.2f\n", col4.c_str(), (float)sum*100 / (sum+count-1));
    }
    i++;
  }
}
Jose Ricardo Bustos M.
  • 8,016
  • 6
  • 40
  • 62
0

I always suggest using a profiler when trying to answer performance questions.

Doing a quick review of your code... it's likely poor performance with the string handling in your C++ code.

You're doing a ton of string copying...

getline()
iss()
copy()
substr()

You then do four replaces on smatch.

  • This is basically a long-form comment and doesn't actually propose a solution. Helpful, but there are two other answers that are more comprehensive. – tadman Sep 24 '15 at 16:21