0

I am interested to get a substring and then convert that into a long int for further processing. I need to do this for a large number of strings. Currently what I do is using .substr(), as shown in the following example Test.

// Example program
#include <iostream>
#include <string>

int main()
{
  std::string content = "123421341234432231112343212343";
  unsigned long int sub = atol(content.substr(0,18).c_str());
  std::cout << "sub: " << sub << '\n';
  return 0;
}

I want to know the fastest way to do this. It's not always .substr(0,18), it can be anything of length 18 (remaining length if not 18) .substr(i,18).

Edit: About the number of strings, roughly 30 million, about fast ( i think to get a substring copy and then converting to long int is a slow process. I want it to be faster than .substr() method). To be honest, I want it to be as fast as it can be.

Actually, the strings are in a fasta file which I read each at a time and remove the unwanted content by boost::split() and store the wanted content. Then I need to do different passes of getting different substrings of the string for further processing.

AwaitedOne
  • 992
  • 3
  • 19
  • 42
  • When you say "large number" what do you mean by that? When you say "fastest way" what do you mean by that? Faster than *what*? What is your base-line "fastness"? How do you measure and profile? Why do you think the code you show isn't the "fastest"? Please elaborate! – Some programmer dude Feb 21 '17 at 11:56
  • @Someprogrammerdude Thanks, kindly see the edit. – AwaitedOne Feb 21 '17 at 12:07
  • 1
    You could skip creating std::strings and use `char*` s. If the substring you want to get is always (0,18) you could use a destructive method, say: `char* content = "whatever"; content[19] = 0;` to cut off the end of the string. – Steeve Feb 21 '17 at 12:16
  • How do you get the strings? Where do they come from? Can't you just "read" the 18 first characters so you don't have to get a sub-string? And have you *tested* with your method? Measured it? At least do that first so you have a baseline to compare against. Then you can start experimenting and see how your experiments compare to the baseline. Or perhaps the baseline might actually be adequate? – Some programmer dude Feb 21 '17 at 12:21
  • 1
    Are the values in the substring guaranteed to be decimal digits? Are the substrings guaranteed to be in range of a long int? Can the numbers be negative? If the answers are yes, yes, no, then a function like `long substr_to_long( const std::string& str, size_t begin, size_t end)` is quite easy to write, and would avoid any copying. – Martin Bonner supports Monica Feb 21 '17 at 12:21
  • Regarding a part of the comment by @MartinBonner: The size of `long` is not fixed, and on some platforms will be to small to handle such large numbers. For example, on Windows using the Visual Studio C++ compiler, a `long` is 32 bits even on 64-bit systems. Use either `long long` which is guaranteed to be at least 64 bits, or the explicit fixed-width type `uint64_t`. – Some programmer dude Feb 21 '17 at 12:26
  • @Someprogrammerdude : Good point. I naturally assume that when one starts optimizing, one is prepared to sacrifice portability to achieve acceptable performance - but given the OP doesn't appear to have made any measurements to back up his gut instinct that the naive approach will not be fast enough, that may not be a good assumption. – Martin Bonner supports Monica Feb 21 '17 at 12:36
  • @Someprogrammerdude I have made some more edit. I have done rough measurement while running the program which takes approx. 80 milliseconds for 700 strings of length 35. – AwaitedOne Feb 21 '17 at 12:44
  • That does indeed seems like a lot of time, especially if extrapolated to around 30 million strings. But then no matter what solution you have will probably be very time-consuming, even with an optimized version of the custom `atol` function suggested by Martin Bonner. Have you thought about parallelization? Threads? – Some programmer dude Feb 21 '17 at 12:48
  • @MartinBonner Thanks, the strings are always decimal digits. Its always long int and has no negatives. – AwaitedOne Feb 21 '17 at 12:49
  • @Someprogrammerdude Actually parallelization and threads are in my mind, but I thought to get it to the maximum speed at first. – AwaitedOne Feb 21 '17 at 12:53
  • I recommend using a type similar to C++17's `std::string_view` or `gsl::cstring_span`, which have trivial `substr` methods by simply returning a new object pointing at a different part of the same string. The caveat is that those strings are not null-terminated, and therefore cannot be used with `atol`. But `strtol` can be used as an alternative. – KABoissonneault Feb 21 '17 at 14:36
  • @KABoissonneault Thanks, does std::string_view allow to create a new sub string for conversion, I think it does not make a copy. – AwaitedOne Feb 21 '17 at 14:40
  • @MartinBonner As I am learning c++, would you provide an example of what you suggested. – AwaitedOne Feb 22 '17 at 09:04

1 Answers1

2

Get substring and convert to long integer in a fastest way

... is almost certainly the wrong question.

With the caveats that you should always measure first, and should know what performance you actually need, and that you haven't really given us enough information to help with those:

creating strings and substrings in your current form is likely to be much more expensive than the integer conversion, so you're worrying about the wrong thing. Profiling first would have shown this.

So (after profiling and assuming I guessed correctly), start by eliminating the copying and dynamic allocation: stop using std::string and substr entirely. Work directly on the raw buffer.

Useless
  • 64,155
  • 6
  • 88
  • 132
  • Actually I am not worrying about the time take for integer conversion, however, for getting substrings. – AwaitedOne Feb 21 '17 at 14:32
  • I don't have any idea of raw buffer, would be great if you provide a small example – AwaitedOne Feb 21 '17 at 14:57
  • 1. Choose an I/O system to use (`iostream`, `cstdio` or something else). 2. Look at what access it provides to the input buffer (`rdbuf` if you chose `ifstream`, or a char array owned by you if you just called `fread`). 3. Use that as directly as possible without copying things into strings, and then copying bits of those strings to other strings. – Useless Feb 21 '17 at 15:13