to read huge data from file and parse the date in efficient way. How to improve the performance for huge data?

Question

I am reading huge data from a file as:

//abc.txt

10  12  14  15  129

-12 14 -18  -900 -1234

145 12 13
12

32 68 51 76 -59 -025 

- - - - etc

fun(char *p, int x, int y, int z) {

}

I have tried to use atoi, strtok, but they are real time consuming when the array is too huge and the sscanf is also very slow.

How can I improve the performance for huge data?

I am using strtok for parsing. I am looking for fast method to parse each line.

I am reading as each line and then parsing each line as:

 char * ptr;
 ptr = strtok (str," ");
 while (ptr != NULL)
 {
    int value1 = atoi(ptr) ;
    ptr = strtok (NULL, " ");
 }

Is there any fast way to parse the string into int?
Is there any alternate approach which would be faster then above code? I am using atoi to convert char * to int.
Can I use other fast method to convert char * to int?

Or perhaps one generated by 'flex'. It knows tricks like building lookup tables. — phs, Jun 18 '12 at 05:46
Read data in big buffer (1M) with fread, then parse the data with your own function (you can strip off all types of checking). But I recommend that you use parsing functions from stdlib to be safe. — nhahtdh, Jun 18 '12 at 05:47
`I am reading huge data from a file as:` how huge? 1 gig? IO is a killer I suggest, instead of reading line by line load as much as you can into memory before you beging parsing it. — Samy Vilar, Jun 18 '12 at 06:04
can you mention what is the reason you want to read the data. I am asking because perhaps we could give a better solution .. Eg. the last time I was reading such large amount of data, I was doing some statistical calculations. So I used the R programming language to do it. R was suitable for me cos it has statistical functions for doing the calculations I wanted to do. So, please mention what is teh task that you are trying to accomplish. — Chani, Jun 18 '12 at 06:37
Are you certain that parsing is the bottleneck and you're not just I/O bound? What are you doing with the data as you parse it? [That's what people were asking last time you posted this.](http://stackoverflow.com/questions/11039935/) — Blastfurnace, Jun 18 '12 at 06:50
Thanks Samy, I am reading almost 8o M. I will try to read by memory.. — Shailesh Kumar Gupta, Jun 18 '12 at 10:50
@nhahtdh: On Windows with default settings, you get virtually no benefit once your buffer is more than 4096 bytes. 1MB is overkill. — Mooing Duck, Jun 18 '12 at 16:00
@MooingDuck: Not sure about that, since I have never tried to read data more than once. On what situation are you talking about? Using fread as I mentioned? — nhahtdh, Jun 19 '12 at 01:04
@RedX: The fread + own parsing function starts to have some effect from 1MB, and there will be quite some difference when it reaches 4-8MB onwards. At least on the online judge I used to participate. — nhahtdh, Jun 19 '12 at 01:18
@nhahtdh: I read that the method made little difference, and the big thing is the operating system's page size. — Mooing Duck, Jun 19 '12 at 04:53
@MooingDuck: Do you mind citing the source? Not that I don't believe it, but just want to read the whole argument. — nhahtdh, Jun 19 '12 at 04:59
@nhahtdh: http://stackoverflow.com/questions/3033771/file-io-with-streams-best-memory-buffer-size, and I found many many pages suggesting 1k-8k, none suggested higher than that. — Mooing Duck, Jun 19 '12 at 06:32
@MooingDuck: From this article (near the end) http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly - you are probably right about the buffer size. I never tested with smaller buffer in my case when I do IO optimization. — nhahtdh, Jun 19 '12 at 06:49

jxh · Answer 1 · 2012-06-18T22:11:23.703

To convert an ASCII string to an integer value, you cannot get much faster than what atoi is doing, but you may be able to speed it up by implementing a conversion function that you use inline. The version below increments the pointer past the digits scanned, so it doesn't match atoi semantics, but it should help improve parser efficiency, illustrated below. (Error checking is obviously lacking, so add it if you need it.)

static inline int my_parsing_atoi(const char *&s) {
    if (s) {
        bool neg = false;
        int val = 0;
        if (*s == '-') { neg = true; ++s; }
        for (;isdigit(*s);++s) val = 10*val + (*s - '0');
        return neg ? -val : val;
    }
    return 0;
}

const char *p = input_line;
if (p) {
    p += strspn(p, " ");
    while (*p) {
        int value1 = my_parsing_atoi(p);
        p += strspn(p, " ");
    }
}

Make sure you have profiled your code properly so that you know that your routine is compute bound and not I/O bound. Most of the time, you will be I/O bound, and the suggestions below are ways to mitigate it.

If you are using the C or C++ file reading routines, such as fread or fstream, you should be getting buffered reads which should already be pretty efficient, but you can try to use underlying OS calls, such as POSIX read, to read the files in larger blocks at a time to speed up file reading efficiency. To be really fancy, you can perform an asynchronous read of your file while you are processing it, either by using threads, or by using aio_read. You can even use mmap, and that will remove some data copying overhead, but if the file is extremely large, you will need to manage the map so that you munmap the portions of the file that have been scanned and mmap in the new portion to be scanned.

I benchmarked my parse routine above and the OP's routine using code that looked like this:

clock_t before_real;
clock_t after_real;
struct tms before;
struct tms after;
std::vector<char *> numbers;
make_numbers(numbers);
before_real = times(&before);
for (int i = 0; i < numbers.size(); ++i) {
    parse(numbers[i]);
}
after_real = times(&after);
std::cout << "user: " << after.tms_utime - before.tms_utime
          << std::endl;
std::cout << "real: " << after_real - before_real
          << std::endl;

The difference between real and user is that real is wall clock time, while user is actual time spent by the OS running the process (so context switches are not counted against the running time).

My tests had my routine running almost twice as fast as the OP's routine (compiled with g++ -O3 on a 64 bit Linux system).

What you mentioned about atoi is quite true from my experience (I used strtol, but they should be more or less the same). However, I don't think the buffer in C/C++ is good enough for large file. I generally get better result with fread rather than the C stdio scan routines — nhahtdh, Jun 18 '12 at 06:33
@nhahtdh: Thanks for the comment. I was talking about `fread`, not the scanning variety. I'll clarify the post. Regards — jxh, Jun 18 '12 at 06:39
Thanks sure i will check with with my_parsing_atoi() function — Shailesh Kumar Gupta, Jun 18 '12 at 10:54
I have check my_parsing_atoi() and strspn with glow code. This taking more time then above code using with strtok and atoi. — Shailesh Kumar Gupta, Jun 18 '12 at 11:44
@user1457091: I was checking for lines with arbitrary whitespace. If you know there will be nothing but spaces, remove `\t\n` from the `strspn` calls. My tests show the technique above to be twice as fast as `strtok` + `atoi`. — jxh, Jun 18 '12 at 15:36

score 3 · Answer 2 · answered Jun 18 '12 at 09:22

You are looking in the wrong place. It isn't the parsing that is the issue, unless you are doing something truly bizarre. On a modern N Ghz CPU the cycle needed per line are tiny. What kills performance is physical I/O. Spinning stuff tends to run at 10s / sec.

I also doubt that the issue is the physical read of the file as this will be efficiently cached in the file system cache.

No, as samy.vilar hints, the issue is almost certainly a virtual memory one:

...the array is too huge...

Use the system monitor/psinfo/top to look at your app. Almost certainly it is growing a large working set as it builds up an in-memory array and your OS is paging this to disk.

So forget reading as an issue. Your real issue is how to manipulate huge data sets in memory. The approaches here are various:

Don't. Batch up the data and manipulate batches.
Use space-efficient storage (e.g. compact elements).
Allocate more memory resources.

There are many discussions around this on SO.

Batching is the best advice in my experience. The hdd arm swinging between read and write positions kills big data operations. — Tom Kerr, Jun 18 '12 at 22:23

score 1 · Answer 3 · answered Jun 18 '12 at 09:44

1

If your file is truly huge, then the IO is what's killing you, and not the parsing. Every time you read a line, you're executing a system call, which can be quite expensive.

A more efficient alternative may be to use Memory-Mapped File IO. If you're working on a POSIX system such as Linux, you can use the mmap command which loads the file all at once and returns a pointer to its location in memory. The memory manager then takes care of reading and swapping the file in/out as you access data through that pointer.

This would look something like this

#include <sys/mman.h>
int fd = open( 'abc.txt' , O_RDONLY );
char *ptr = mmap( NULL , length , PROT_READ , MAP_PRIVATE , fd , 0 );

but I would strongly advise you to read the man page and find the best options for yourself.

answered Jun 18 '12 at 09:44

Pedro

1,344
9
17

@user1457091: Windows supports mapped files via the Windows API function [`CreateFileMapping`](http://msdn.microsoft.com/en-us/library/aa366537.aspx), which has an interface similar to the POSIX `mmap` function. – Pedro Jun 18 '12 at 12:01
@user1457091: Cool. Could you update your question with timing data for the different solutions proposed here once you've tried them? – Pedro Jun 18 '12 at 16:57
till now best approach reading line by line then calling handle_numbers_in_buffer function(). I need to try reading file with CreateFileMapping function. – Shailesh Kumar Gupta Jun 19 '12 at 09:54

score 0 · Answer 4 · answered Jun 18 '12 at 05:57

0

If your file contains int numbers, you can use operator>>, but this is c++ only solution. Something like :
```
std::fstream f("abc.txt");
int value = 0;
f >> value
```
If you convert your file to contain binary numbers representation, you will have more options to improve performances. Not only it avoid parsing numbers from string into type, but also you have another options to access your data (like for example using mmap).

answered Jun 18 '12 at 05:57

BЈовић

62,405
41
173
273

I dont think this would be efficient then c standard function() – Shailesh Kumar Gupta Jun 18 '12 at 11:50
@user1457091 *I dont think this would be efficient then c standard function()* because – BЈовић Jun 18 '12 at 12:01
1

@BЈовић: (1) I've never heard of a C++ library, where anyone measured otherwise. (2) Visual Studio's iostreams call printf and family underneath, after much locking and state overhead. (3) The standards committee mentioned at one point that it's _theoretically possible_ to make iostreams as fast as printf/family, but as of that point, no linker could do it. – Mooing Duck Jun 18 '12 at 16:08

Thomas Padron-McCarthy · Answer 5 · 2012-06-18T14:29:25.630

First of all, a general recommendation is to always use profiling to check that it actually is the translation that is slow, and not something else, such as reading the file physically from disk.

You may be able to improve the performance by writing your own, minimal number-parsing function. strtok modifies the string, so it will not be optimally fast, and if you know that all the numbers are decimal integers, and you don't need any error checking, you can simplify the translation a bit.

Some code without strtok that may speed up the processing of a line, if it actually is the translation and not (for example) I/O that is the problem.

void handle_one_number(int number) {
    // ....
}

void handle_numbers_in_buffer(char *buffer) {
    while (1) {
        while (*buffer != '\0' && isspace(*buffer))
            ++buffer;
        if (*buffer == '\0')
            return;
        int negative = 0;
        if (*buffer == '-') {
            negative = 1;
            ++buffer;
        }
        int number = 0;
        while (isdigit(*buffer)) {
            number = number * 10 + *buffer - '0';
            ++buffer;
        }
        if (negative)
            number = -number;
        handle_one_number(number);
    }
}

I actually went and ran some benchmarks. I had expected the I/O to be dominant, but it turns out that (with the usual caveat about "on my particular system, with my particular compiler") the parsing of numbers takes quite a lot of time.

By changing from the strtok version to my code above I managed to improve the time for the translation of 100 million numbers (with the text already in memory) from 5.2 seconds to around 1.1 second. When reading from a slow disk (Caviar Green) I measured an improvement from 5.9 seconds to 3.5 seconds. When reading from an SSD, I measured an improvement from 5.8 to 1.8 seconds.

I also tried reading the file directly, using while (fscanf(f, "%d", ....) == 1) ...., but that turned out to be much slower (10 seconds), probably since fscanf is thread-safe, and more calls require more locking.

(GCC 4.5.2 on Ubuntu 11.04 with -O2 optimization, several executions of each version, flushing disk caches between runs, i7 processor.)

Is there any function modified strtok function that implify the translation a bit. — Shailesh Kumar Gupta, Jun 18 '12 at 11:49
@user1457091: I have added some code, and benchmarks. You may be able to speed up your program quite a bit. — Thomas Padron-McCarthy, Jun 18 '12 at 14:38
Thanks handle_numbers_in_buffer() is faster then my code. i was able to speed up from 2.4 sec to 1.7 sec. — Shailesh Kumar Gupta, Jun 19 '12 at 06:12
what would happen if i would read by one(single) char using with fgetc() and then do other operation. — Shailesh Kumar Gupta, Jun 19 '12 at 06:40
When I tried it now, reading the lines character by character with fgetc was slower than a single fgets for each line. — Thomas Padron-McCarthy, Jun 19 '12 at 09:28
Yes i also tried to read line by line with fgets was slower then single line. Now reading line by line then calling handle_numbers_in_buffer() function. I could not get faster then this. — Shailesh Kumar Gupta, Jun 19 '12 at 09:49

to read huge data from file and parse the date in efficient way. How to improve the performance for huge data?

5 Answers5