any one know how to convert a huge char array to float, very huge array, performance better than the atof/strtod/sscanf

Question

I got a char array, a huge array char p[n] read from a txt like.

//1.txt
194.919 -241.808 234.896
195.569 -246.179 234.482
194.919 -241.808 234.896
...

foo(char *p, float x, float y, float z) {

}

I tried to use atof, strtod, but they are real time consuming when the array is too huge, because they will call the strlen(). and the sscanf is also very slow....

I debug into the code and find that both atof() and strtod call the strlen() in the visual studio, we can check the crt code.

strtod() call:
        answer = _fltin2( &answerstruct, ptr, (int)strlen(ptr), 0, 0, _loc_update.GetLocaleT());


atof() call:
        return( *(double *)&(_fltin2( &fltstruct, nptr, (int)strlen(nptr), 0, 0, _loc_update.GetLocaleT())->dval) );

I also try to use strtok, but we should not change any data in the 1.txt.

so any one have the best way to convert all these to float x, y, z.

Visual studio 2008 + WIN7

double __cdecl _atof_l( REG1 const char *nptr, _locale_t plocinfo ) { ...... /* let _fltin routine do the rest of the work */ return( *(double *)&(_fltin2( &fltstruct, nptr, (int)strlen(nptr), 0, 0, _loc_update.GetLocaleT())->dval) ); } — user236515, Jan 09 '10 at 15:51
Actually, r9r9r9 is correct. strtod *does* call strlen on the pointer that is passed. Just single-stepping through a strtod call quickly brings this up. It's not a one-time initialization type deal either; happens on every call to strtod. Inexcusable. — Tarydon, Jan 09 '10 at 16:38
@r9r9r9: I know it's not useful for your project, but I just stepped through the VC2010 runtime library and found that the new strtod does *not* call strlen. It runs much faster, especially with long strings. — Tarydon, Jan 09 '10 at 16:43
Ouch. As a workaround, you could perhaps copy "enough" (say 30) characters into a buffer of your own, NUL-terminate it, call strtod on that, work out how many chars were actually consumed by the call, move the remainder down to the start of the buffer, and refill the rest of the buffer. Nasty, but at least it's O(N) rather than the O(N^2) you get with a strlen on every call. Or manually split on whitespace (or newlines) and copy out one value (three values) at a time, if the file format is reliable enough for that. — Steve Jessop, Jan 09 '10 at 18:38

Lior Kogan · Answer 1 · 2010-01-09T18:50:50.620

Check out this code.

It can be further optimized if there's no need to support scientific representation, '+' sign, or leading tabs.

It doesn't use strlen, or any other standard library string routine.

// convert floating-point value in string represention to it's numerical value
// return false if NaN
// F is float/double
// T is char or wchar_t
// '1234.567' -> 1234.567
template <class F, class T> inline bool StrToDouble(const T* pczSrc, F& f)
{
    f= 0;

    if (!pczSrc)
        return false;

    while ((32 == *pczSrc) || (9 == *pczSrc))
        pczSrc++;

    bool bNegative= (_T('-') == *pczSrc);

    if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
        pczSrc++;

    if ( (*pczSrc < _T('0')) || (*pczSrc > _T('9')) )
        return false;

    // todo: return false if number of digits is too large

    while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
    {
        f= f*10. + (*pczSrc-_T('0'));
        pczSrc++;
    }

    if (_T('.') == *pczSrc)
    {
        pczSrc++;

        double e= 0.;
        double g= 1.;

        while ( (*pczSrc >= _T('0')) && (*pczSrc<=_T('9')) )
        {
            e= e*10. + (*pczSrc-_T('0'));
            g= g*10.                    ;
            pczSrc++;
        }

        f+= e/g;
    }

    if ( (_T('e') == *pczSrc) || (_T('E') == *pczSrc) ) // exponent, such in 7.32e-2
    {
        pczSrc++;

        bool bNegativeExp= (_T('-') == *pczSrc);

        if ( (_T('-') == *pczSrc) || (_T('+') == *pczSrc) )
            pczSrc++;

        int nExp= 0;
        while ( (*pczSrc >= _T('0')) && (*pczSrc <= _T('9')) )
        {
            nExp= nExp*10 + (*pczSrc-_T('0'));
            pczSrc++;
        }

        if (bNegativeExp)
            nExp= -nExp;

        // todo: return false if exponent / number of digits of exponent is too large

        f*= pow(10., nExp);
    }

    if (bNegative)
        f= -f;

    return true;
}

score 1 · Answer 2 · answered Jan 09 '10 at 16:32

If you can make additional assumptions about the format of the floating point values, parsing them yourself might increase performance.

Example code for parsing ' ' or '\n'-separated values without exponents and no input validation:

float parsef(const char **str)
{
    const char *cc = *str;

    _Bool neg = (*cc == '-');
    if(neg) ++cc;

    float value = 0, e = 1;

    for(; *cc != '.'; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        value *= 10;
        value += *cc - '0';
    }

    for(++cc;; ++cc)
    {
        if(*cc == ' ' || *cc == '\n' || !*cc)
        {
            *str = cc;
            return neg ? -value : value;
        }

        e /= 10;
        value += (*cc - '0') * e;
    }
}

Example code:

const char *str = "42 -15.4\n23.001";
do printf("%f\n", parsef(&str));
while(*str++);

score 1 · Answer 3 · answered Jan 09 '10 at 17:16

Okay, how about doing the tokenization yourself and then calling strtod.

What I'm thinking is something like this:

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char buffer[64];
    unsigned int idx = 0;

    // copy over current number
    while (*current != '\0' && !isspace(*current))
    {
        buffer[idx++] = *current++;
    }
    buffer[idx] = '\0';

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }

    // use strtod to convert buffer   
}

Some issues with this is the tokenization is very simple. It will work for the format you posted, but if the format varies (another line uses : to separate the numbers), it won't work.

Another issue is that the code assumes all numbers have < 64 characters. If they are longer, you'll get a buffer overflow.

Also, the copying to a temporary buffer will add some overhead (but hopefully less then the overhead of constantly doing a strlen on the entire buffer). I know you said you can't change the original buffer, but can you do a temporary change (i.e. the buffer can change as as long as you return it to it's original state before you return):

char *current = ...;  // initialited to the head of your character array
while (*current != '\0')
{
    char *next_sep = current;
    while (*next_sep != '\0' && !isspace(*next_sep))
    {
        next_sep++;
    }

    // save the separator before overwriting it
    char tmp = *next_sep;
    *next_sep = '\0';

    // use strtod on current

   // Restore the separator.
   *next_sep = tmp;

    current = next_sep;

    // move forward to next number
    while (*current != '\0' && isspace(*current))
    {
        current++;
    }
}

This technique means no copying and no worries about buffer overflow. You do need to temporarily modify the buffer; hopefully that is

thanks, the problem is we should not change the input data, and if we copy one backup, it will still slow. if we allow to change the data, strtok() will be a good way. — user236515, Jan 09 '10 at 17:20
Copying a small portion of the buffer for each float to convert shouldn't be too slow - as long as you only copy each character a fixed number of times, it'll overall be an O(N) operation (instead of the O(N^2) behaviour you're getting with `strtod`). Copying a few characters is a tiny overhead compared to the actual floating point conversion. — caf, Jan 11 '10 at 01:13

score 0 · Answer 4 · answered Jan 09 '10 at 15:38

0

As long as you are not using a particularly bad standard library (impossible these times, they are all good) it's not possible to do it faster than atof.

answered Jan 09 '10 at 15:38

Andreas Bonini

44,018
30
122
156

if i want to use the atof(), i must impore the process of strlen(), that means i must fill the char array with lot of '\0', but we should not change the data. – user236515 Jan 09 '10 at 15:44
As the comments to the question now reveal: turns out MSVC has an "impossible" bad standard library. Interpreting his comment here: r9r9r9 gets crummy performance unless he can nul-terminate a chunk of his array at a time. Which he can't do, because it's not modifiable. – Steve Jessop Jan 09 '10 at 18:32

jason · Answer 5 · 2010-01-09T15:48:21.253

0

Use strtod. It almost certainly does not call strlen. Why would it need to know the length of the input? It merely runs past leading whitespace, then consumes as many characters as possible that make sense for a floating point literal, and then returns a pointer just past that. You can see an example implementation Perhaps you're using it non-optimally? Here's a sample of how to use strtod:

#include <stdio.h>
#include <stdlib.h>
int main() {
    char *p = "1.txt 194.919 -241.808 234.896 195.569 -246.179 234.482 194.919 -241.808 234.896";
    char *end = p;
    char *q;
    double d;
    while(*end++ != ' '); // move past "1.txt"
    do {
        q = end; 
        d = strtod(q, &end);
        printf("%g\n", d);
    } while(*end != '\0');
}

This outputs:

on my machine.

edited Jan 09 '10 at 15:48

answered Jan 09 '10 at 15:40

jason

236,483
35
423
525

Of course I read his question. I think he is wrong about `strtod` calling `strlen` and believe that he might be using it incorrectly. Therefore I gave him a sample implementation as well as commenting about `strtod` not using `strlen`. – jason Jan 09 '10 at 15:45
may be i was lost, maybe the strtod() didn't call the strlen(), but it is still too slow for me the atof call the strlen()... – user236515 Jan 09 '10 at 15:50
First, the above is about as fast as you're going to get. Second, I doubt that `atof` calls `strlen`. – jason Jan 09 '10 at 15:54
thanks for the implementation code, i agree with you ,atof/strtod should not call the strlen, but they really did that, i debug the code and find that. – user236515 Jan 09 '10 at 16:20
@Jason: r9r9r9 is right. strtod does call strlen in the VC2008 runtime library – Tarydon Jan 09 '10 at 16:39
@Tarydon: Wow, that is just awful (and unnecessary). – jason Jan 09 '10 at 16:40
@r9r9r9: Okay, if you're sure that they use `strlen`, then I would suggest just writing your own `strtod`. Start with the implementation that I linked to already. – jason Jan 09 '10 at 16:41

score 0 · Answer 6 · answered Jan 09 '10 at 15:42

0

I don't see any reason why strod() should call strlen(). Of course it might, but nothing in its specification requires it and I'd be suprised if it did. And I'd say that strtod() about as fast as you'll get, short of writing some FPU processor-specific stuff yourself.

answered Jan 09 '10 at 15:42

yes, i don't think we should call the strlen(), so i think there is better way than the CRT. – user236515 Jan 09 '10 at 15:58
i means, i don't think the atof/strtod should call the strlen, but they really did, i debug the code the find that. so i think there is better way. – user236515 Jan 09 '10 at 16:04

score 0 · Answer 7 · answered Jan 09 '10 at 15:45

Why do you think atof, strtod use strlen? I've never implemented them, but I can't imagine why they'd need to know the length of the input string. It would be of no value to them. I'd use strtod as per Jason's answer. That's what it's for.

And yes, if you have a very large amount of text, it's going to take some time to convert. That's just the way it is.

score 0 · Answer 8 · answered Jan 09 '10 at 16:00

0

As others have said, I don't think you're going to do much better than the standard library calls. They have been around for a long time and are quite highly optimized (well, they should be, at least in good implementations).

That said, there are some things that aren't clear to me. Are you reading the whole file into memory and then converting the array to another array? If so, you might want to check that the system you are running on has enough memory to do that with swapping. If you are doing this, would it be possible to just convert one line at a time as you read them off disk instead of storing them?

You could consider multithreading your program. One thread to read and buffer lines off disk, and n threads to process the lines. Dr. Dobb's Journal published a great single-reader/single-writer lockless queue implementation you could use. I've used this in a similar app. My worker threads each have an input queue, and then reader thread reads data off disk and places them into these queues in round robin style.

answered Jan 09 '10 at 16:00

jbl

2,710
3
18
13

yes, i used the file mapping(CreateFileMapping) to get all the data into the memory, and there is enough memory. yes, multithreading is a way, but each thread still to handle this. – user236515 Jan 09 '10 at 16:07
It's wrong to assume you can't do better than standard library calls in this case. The library calls are small and general-purpose. If speed matters more than anything else, you can do dramatically better. – dmazzoni Jan 09 '10 at 18:25
@dmazzoni - Fair enough. There wasn't much info in the OP, so I assumed the general case in which the input is basically unknown. – jbl Jan 09 '10 at 20:22
EDIT: Oops, I see this has pretty much already been suggested. @r9r9r9 It might be worth testing a quick implementation that reads the file in line by line into a buffer with, say, fgets(), which will null terminate input for you. This would keep the data on disk unmodified. – jbl Jan 09 '10 at 20:24
@jbl: yes, i have tried thousand way, and find out that, fscanf is better that fget+sscanf, and use the strtok() is better that fscanf. – user236515 Jan 10 '10 at 03:46

score 0 · Answer 9 · answered Jan 09 '10 at 16:11

How about something like:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static float frac[] =
{
    0.000,
    0.001,
    0.002,
    ...               // fill in
    0.997,
    0.998,
    0.999,
};

static float exp[] =
{
    1e-38,
    1e-37,
    1e-36,
    ...               // fill in
    1e+36,
    1e+37,
    1e+38,
};

float cvt(char* p)
{
    char* d = strchr(p, '.');   // Find the decimal point.
    char* e = strchr(p, 'e');   // Find the exponent.
    if (e == NULL)
        e = strchr(p, 'E');

    float num = atoi(p);
    if (num > 0) {
        num += frac[atoi(d + 1)];
    } else {
        num -= frac[atoi(d + 1)];
    }
    if (e)
        num *= exp[atoi(e)];
    return num;
}

int main()
{
    char line[100];
    while(gets(line)) {
        printf("in %s, out %g\n", line, cvt(line));
    }
}

Should be good to three significant digits.

Edit: watch out for big mantissas.

Edit again: and negative exponents. :-(

Good point, you'd have to manipulate the fraction a little bit. — Richard Pennington, Jan 09 '10 at 16:22
If you want to roll your own, it's probably better to find a free (as in speech) version of atof.c online rather than start from scratch. — Steve Jessop, Jan 09 '10 at 18:35

score 0 · Answer 10 · answered Jan 10 '10 at 18:11

0

I doubt if strlen is costing you much.

If you can take advantage of your numbers falling in a relatively restricted range, then what I suggest is to parse it yourself, doing as little computation as possible, such as:

#define DIGIT(c) ((c)>='0' && (c)<='9')

BOOL parseNum(char* *p0, float *f){
  char* p = *p0;
  int n = 0, frac = 1;
  BOOL bNeg = FALSE;
  while(*p == ' ') p++;
  if (*p == '-'){p++; bNeg = TRUE;}
  if (!(DIGIT(*p) || *p=='.')) return FALSE;
  while(DIGIT(*p)){
    n = n * 10 + (*p++ - '0');
  }
  if (*p == '.'){
    p++;
    while(DIGIT(*p)){
      n = n * 10 + (*p++ - '0');
      frac *= 10;
    }
  }
  *f = (float)n/(float)frac;
  if (bNeg) *f = -*f;
  *p0 = p;
  return TRUE;
}

answered Jan 10 '10 at 18:11

Mike Dunlavey

40,059
14
91
135

`strlen` is costing him a lot, because it's being called on the entire remaining buffer for each floating point value in the buffer. This turns the whole operation into an O(N^2) deal. – caf Jan 11 '10 at 01:08
@caf: Yeah that seems kind of silly. A profiler or stackshot would pick that out real quick. – Mike Dunlavey Jan 11 '10 at 01:14
... FYI stackshots: http://stackoverflow.com/questions/406760/whats-your-most-controversial-programming-opinion/1562802#1562802 – Mike Dunlavey Jan 11 '10 at 01:17

any one know how to convert a huge char array to float, very huge array, performance better than the atof/strtod/sscanf

10 Answers10