17

The last time update: my classmate uses fread() to read about one third of the whole file into a string, this can avoid lacking of memory. Then process this string, separate this string into your data structure. Notice, you need to care about one problem: at the end of this string, these last several characters may cannot consist one whole number. Think about one way to detect this situation so you can connect these characters with the first several characters of the next string. Each number is corresponding to different variable in your data structure. Your data structure should be very simple because each time if you insert your data into one data structure, it is very slow. The most of time is spent on inserting data into data structure. Therefore, the fastest way to process these data is: using fread() to read this file into a string, separate this string into different one-dimensional arrays. For example(just an example, not come from my project), I have a text file, like:

72     24      20
22     14      30
23     35      40
42     29      50
19     22      60
18     64      70
 .
 .
 .

Each row is one person's information. The first column means the person's age, the second column is his deposit, the second is his wife's age. Then we use fread() to read this text file into string, then I use stroke() to separate it(you can use faster way to separate it). Don't use data structure to store the separated data! I means, don't do like this:

struct person
{
    int age;
    int deposit;
    int wife_age;
};
struct person *my_data_store;
my_data_store=malloc(sizeof(struct person)*length_of_this_array);
//then insert separated data into my_data_store

Don't use data structure to store data! The fastest way to store your data is like this:

int *age;
int *deposit;
int *wife_age;

age=(int*)malloc(sizeof(int)*age_array_length);
deposit=(int*)malloc(sizeof(int)*deposit_array_length);
wife_age=(int*)malloc(sizeof(int)*wife_array_length);
// the value of age_array_length,deposit_array_length and wife_array_length will be known by using `wc -l`.You can use wc -l to get the value in your C program
// then you can insert separated data into these arrays when you use `stroke()` to separate them.

The second update: The best way is to use freed() to read part of the file into a string, then separate these string into your data structure. By the way, don't use any standard library function which can format string into integer , that's to slow, like fscanf() or atoi(), we should write our own function to transfer a string into n integer. Not only that, we should design a more simpler data structure to store these data. By the way, my classmate can read this 1.7G file within 7 seconds. There is a way can do this. That way is much better than using multithread. I haven't see his code, after I see his code, I will update the third time to tell you how could hi do this. That will be two months later after our course finished.

Update: I use multithread to solve this problem!! It works! Notice: don't use clock() to calculate the time when using multithread, that's why I thought the time of execution increases.

One thing I want to clarify is that, the time of reading the file without storing the value into my structure is about 20 seconds. The time of storing the value into my structure is about 60 seconds. The definition of "time of reading the file" includes the time of read the whole file and store the value into my structure. the time of reading the file = scan the file + store the value into my structure. Therefore, have some suggestions of storing value faster ? (By the way, I don't have control over the inout file, it is generated by our professor. I am trying to use multithread to solve this problem, if it works, I will tell you the result.)

I have a file, its size is 1.7G. It looks like:

1   1427826
1   1427827
1   1750238
1   2
2   3
2   4
3   5
3   6
10  7
11  794106
.
.

and son on. It has about ten millions of lines in the file. Now I need to read this file and store these numbers in my data structure within 15 seconds. I have tried to use freed() to read whole file and then use strtok() to separate each number, but it still need 80 seconds. If I use fscanf(), it will be slower. How do I speed it up? Maybe we cannot make it less than 15 seconds. But 80 seconds to read it is too long. How to read it as fast as we can?

Here is part of my reading code:

int Read_File(FILE *fd,int round)
{
clock_t start_read = clock();


int first,second;
first=0;
second=0;


fseek(fd,0,SEEK_END);
long int fileSize=ftell(fd);
fseek(fd,0,SEEK_SET);
char * buffer=(char *)malloc(sizeof(char)*fileSize);
char *string_first;
long int newFileSize=fread(buffer,1,fileSize,fd);
char *string_second;
     while(string_first!=NULL)

    {
        first=atoi(string_first);

        string_second=strtok(NULL," \t\n");

        second=atoi(string_second);

        string_first=strtok(NULL," \t\n");

        max_num= first > max_num ? first : max_num ;
        max_num= second > max_num ? second : max_num ;
        root_level=first/NUM_OF_EACH_LEVEL;

        leaf_addr=first%NUM_OF_EACH_LEVEL;

        if(root_addr[root_level][leaf_addr].node_value!=first)
        {
            root_addr[root_level][leaf_addr].node_value=first;
            root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
            root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));

            root_addr[root_level][leaf_addr].g_credit[0]=1;

            root_addr[root_level][leaf_addr].head->neighbor_value=second;

           root_addr[root_level][leaf_addr].head->next=NULL;
            root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;
            root_addr[root_level][leaf_addr].degree=1;
        }
        else
        {
            //insert its new neighbor
            Neighbor *newNeighbor;
            newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));

            newNeighbor->neighbor_value=second;

            root_addr[root_level][leaf_addr].tail->next=newNeighbor;
            root_addr[root_level][leaf_addr].tail=newNeighbor;

            root_addr[root_level][leaf_addr].degree++;
        }
        root_level=second/NUM_OF_EACH_LEVEL;

        leaf_addr=second%NUM_OF_EACH_LEVEL;
        if(root_addr[root_level][leaf_addr].node_value!=second)
        {
            root_addr[root_level][leaf_addr].node_value=second;
            root_addr[root_level][leaf_addr].head=(Neighbor *)malloc(sizeof(Neighbor));
            root_addr[root_level][leaf_addr].tail=(Neighbor *)malloc(sizeof(Neighbor));

            root_addr[root_level][leaf_addr].head->neighbor_value=first;

            root_addr[root_level][leaf_addr].head->next=NULL;
            root_addr[root_level][leaf_addr].tail=root_addr[root_level][leaf_addr].head;

            root_addr[root_level][leaf_addr].degree=1;

            root_addr[root_level][leaf_addr].g_credit[0]=1;
        }
        else
        {
            //insert its new neighbor
            Neighbor *newNeighbor;
            newNeighbor=(Neighbor*)malloc(sizeof(Neighbor));

            newNeighbor->neighbor_value=first;

            root_addr[root_level][leaf_addr].tail->next=newNeighbor;
            root_addr[root_level][leaf_addr].tail=newNeighbor;

            root_addr[root_level][leaf_addr].degree++;
        }
    }
Cœur
  • 37,241
  • 25
  • 195
  • 267
beasone
  • 1,073
  • 1
  • 14
  • 32
  • 10
    Can your machine even read 1.7GB in 15 seconds? – Mysticial Apr 22 '15 at 00:13
  • Did you try reading the file line by line using `getline` ? – stdcall Apr 22 '15 at 00:13
  • "Can your machine even read 1.7GB in 15 seconds?", yes, it can. – beasone Apr 22 '15 at 00:15
  • "Did you try reading the file line by line using get line?" I haven't. Will that be faster? I will try. – beasone Apr 22 '15 at 00:15
  • 2
    *"If I use `fscanf()` it will be slower"* How do you know that? Did you try it? If so, how long did it take? – user3386109 Apr 22 '15 at 00:17
  • It seems get line() is C++ function. But I am using C to code my program. – beasone Apr 22 '15 at 00:17
  • It's not. it's a standard libc function – stdcall Apr 22 '15 at 00:18
  • Instead of using `strtok`, try calling `strtol` repeatedly. It can return an end pointer, which you can use for the next call. – JWWalker Apr 22 '15 at 00:18
  • 2
    I'm curios, can you check how much time it takes to just read the entire file to RAM without parsing it ? – stdcall Apr 22 '15 at 00:18
  • I tried fscanf(), it is slower. It will take about 15 seconds longer than using freed() and strtok(). – beasone Apr 22 '15 at 00:18
  • I'm curios, can you check how much time it takes to just read the entire file to RAM without parsing it ?---------About 23 seconds or less without parsing number into my data structure. – beasone Apr 22 '15 at 00:19
  • But I need to parse these numbers into my array or data structure for calculating them after reading the file. – beasone Apr 22 '15 at 00:19
  • 19
    Well, if it takes 23 seconds to read the file to RAM without parsing, then you can't read + parse in less than 15 seconds, period. – Mooing Duck Apr 22 '15 at 00:27
  • `strtok` and `fscanf` should both be about as fast as you're going to get. If those aren't fast enough, then likely _nothing_ is fast enough. I suspect if we see your code though, we'll find a slowdown to remove. – Mooing Duck Apr 22 '15 at 00:29
  • I put part of my code online so that you can see that. – beasone Apr 22 '15 at 00:33
  • 5
    See http://stackoverflow.com/q/8123094/103167 Although the code is C++, the ideas (like memory-mapped I/O) work equally well in C. – Ben Voigt Apr 22 '15 at 00:42
  • @BenVoigt Do you mean use get line() to do this? – beasone Apr 22 '15 at 00:48
  • 1
    I see nothing in the above code that actually reads data into memory. How are you filling you malloc buffer? – Michael Dorgan Apr 22 '15 at 00:48
  • @beasone: No, I would not use `getline()` for this. – Ben Voigt Apr 22 '15 at 00:49
  • Sorry, I forgot to copy "long int newFileSize=fread(buffer,1,fileSize,fd);" – beasone Apr 22 '15 at 00:50
  • @beasone: Ahh, now your first mistake is clear. Read the comments below Arun's answer to understand why reading the entire file at once slows you down. – Ben Voigt Apr 22 '15 at 00:51
  • 1
    BTW, still nothing above moving data into your string buffers, as well as using unitialized memory for your first compares. This can't be your actual code can it? – Michael Dorgan Apr 22 '15 at 00:52
  • My original code is very long, so I only copy the key part of it. – beasone Apr 22 '15 at 00:52
  • 1
    ok, reading the data in chunks would allow threading to help (read a chunk, process while next chunk comes in, repeat), but if that initial fread() is taking 23 seconds, that is honestly the best you are going to be able to do. – Michael Dorgan Apr 22 '15 at 00:54
  • But the length of each line is different. I don't which length of chunk I should read at each time. – beasone Apr 22 '15 at 00:56
  • 1
    Then read in MAGIC_NUMBER size - say 100k. Have another thread begin processing that data. Because your data is`\n` terminated, you should be able to detect when you reach the end of your buffer and wait for the next transfer before finishing that last line. – Michael Dorgan Apr 22 '15 at 00:58
  • @beasone: 4 megabytes at a time is a very good chunk size, I've found. It's not so large that it creates memory pressure, but plenty big enough to use bulk (fast) I/O operations. Plus, the OS will learn you are reading in this pattern and have the next chunk ready for you while you're processing (no need to start threads yourself) – Ben Voigt Apr 22 '15 at 00:58
  • BTW, not to start a holy war here, but C# does this kind of stuff pretty easily without memory allocation headache and other issues... – Michael Dorgan Apr 22 '15 at 01:00
  • 1
    A useful measure to determine if you're IO bound or CPU bound is to copy the file to a RAM disk and process it from there (Assuming you have plenty of RAM.). Compare the time with processing from local disk. If its about the same, then you're CPU bound, if not you're IO bound. It's usually only worth focusing on one of the two issues at a time. I'd expect that you're IO bound, and unless you solve that, no amount of mucking with mallocs etc is going to save you much time.. – Michael Anderson Apr 22 '15 at 04:21
  • If you have control of how you generate a file, you could compress it, for example using gzip/snappy/lz4. It will take ~ 100 times less space if not more and hence your IO reading time will dramatically improve. You'd waste a bit of time for decompression, however the gain in IO will be much-much larger. – oleksii Apr 22 '15 at 11:53

5 Answers5

15

Some suggestions:

a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.

b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.

c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).

d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.

e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.

With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • This is old, but interestingly, use 64KB - most caches are 64KB aligned. The performance boost will surprise you ;) – David Lannan Mar 11 '20 at 00:51
11

My suggestion would be to form a processing pipeline and thread it. Reading the file is an I/O bound task and parsing it is CPU bound. They can be done at the same time in parallel.

Holly
  • 5,270
  • 1
  • 24
  • 27
  • 1
    Do you mean to use multithread to process it? Like using pthread_create() to create another thread and then do something in that thread? – beasone Apr 22 '15 at 00:24
  • Exactly. You have producer-consumer issues and thread safety to worry about, but the code is easy to do so. – Holly Apr 22 '15 at 00:25
  • 13
    If you have control over the writing of the file then another option is to use a binary rather than a text file. That will significantly reduce the file size and hence the amount of time needed for reading it. And you wouldn't have to deal with expensive string operations to parse the data. – kaylum Apr 22 '15 at 00:29
  • I don't have control over the writing of the file. – beasone Apr 22 '15 at 00:39
  • 3
    In that case, parallelisation is your best bet. In addition to what Neal wrote I would suggest not just pipelineing the read and parse but also chunking the problem. Create multiple threads and assign them different parts of the file to read and parse in parallel. May need a step to combine your parse results depending on how you do it. – kaylum Apr 22 '15 at 00:59
4

There are several possibilities. You'll have to experiment.

  • Exploit what your OS gives you. If Windows, check out overlapped io. This lets your computation proceed with parsing one buffer full of data while the Windows kernel fills another. Then switch buffers and continue. This is related to what @Neal suggested, but has less overhead for buffering. Windows is depositing data directly in your buffer through the DMA channel. No copying. If Linux, check out memory mapped files. Here the OS is using the virtual memory hardware to do more-or-less what Windows does with overlapping.

  • Code your own integer conversion. This is likely to be a bit faster than making a clib call per integer.

Here's example code. You want to absolutely limit the number of comparisons.

// Process one input buffer.
*end_buf = ' '; // add a sentinel at the end of the buffer
for (char *p = buf; p < end_buf; p++) {
  // somewhat unsafe (but fast) reliance on unsigned wrapping
  unsigned val = *p - '0';
  if (val <= 9) {
    // Found start of integer.
    for (;;) {
      unsigned digit_val = *p - '0';
      if (digit_val > 9) break;
      val = 10 * val + digit_val;
      p++;
    }
    ... do something with val
  }
}
  • Don't call malloc once per record. You should allocate blocks of many structs at a time.

  • Experiment with buffer sizes.

  • Crank up compiler optimizations. This is the kind of code that benefits greatly from excellent code generation.

Gene
  • 46,253
  • 4
  • 58
  • 96
  • 1
    [Windows can use memory mapped files](https://msdn.microsoft.com/en-us/library/windows/desktop/aa366556%28v=vs.85%29.aspx) and [linux can also use asynchronous IO](http://man7.org/linux/man-pages/man7/aio.7.html) – ratchet freak Apr 22 '15 at 08:45
3

Yes, standard library conversion functions are surprisingly slow.

If portability is not a problem, I'd memory-map the file. Then, something like the following C99 code (untested) could be used to parse the entire memory map:

#include <stdlib.h>
#include <errno.h>

struct pair {
    unsigned long  key;
    unsigned long  value;
};

typedef struct {
    size_t       size; /* Maximum number of items */
    size_t       used; /* Number of items used */
    struct pair  item[];
} items;

/* Initial number of items to allocate for */
#ifndef ITEM_ALLOC_SIZE
#define ITEM_ALLOC_SIZE 8388608
#endif

/* Adjustment to new size (parameter is old number of items) */
#ifndef ITEM_REALLOC_SIZE
#define ITEM_REALLOC_SIZE(from) (((from) | 1048575) + 1048577)
#endif 

items *parse_items(const void *const data, const size_t length)
{
    const unsigned char       *ptr = (const unsigned char *)data;
    const unsigned char *const end = (const unsigned char *)data + length;
    items                  *result;
    size_t                  size = ITEMS_ALLOC_SIZE;
    size_t                  used = 0;
    unsigned long           val1, val2;

    result = malloc(sizeof (items) + size * sizeof (struct pair));
    if (!result) {
        errno = ENOMEM;
        return NULL;
    }

    while (ptr < end) {

        /* Skip newlines and whitespace. */
        while (ptr < end && (*ptr == '\0' || *ptr == '\t' ||
                             *ptr == '\n' || *ptr == '\v' ||
                             *ptr == '\f' || *ptr == '\r' ||
                             *ptr == ' '))
            ptr++;

        /* End of data? */
        if (ptr >= end)
            break;

        /* Parse first number. */
        if (*ptr >= '0' && *ptr <= '9')
            val1 = *(ptr++) - '0';
        else {
            free(result);
            errno = ECOMM; /* Bad data! */
            return NULL;
        }
        while (ptr < end && *ptr >= '0' && *ptr <= '9') {
            const unsigned long old = val1;
            val1 = 10UL * val1 + (*(ptr++) - '0');
            if (val1 < old) {
                free(result);
                errno = EDOM; /* Overflow! */
                return NULL;
            }
        }

        /* Skip whitespace. */
        while (ptr < end && (*ptr == '\t' || *ptr == '\v'
                             *ptr == '\f' || *ptr == ' '))
            ptr++;
        if (ptr >= end) {
            free(result);
            errno = ECOMM; /* Bad data! */
            return NULL;
        }

        /* Parse second number. */
        if (*ptr >= '0' && *ptr <= '9')
            val2 = *(ptr++) - '0';
        else {
            free(result);
            errno = ECOMM; /* Bad data! */
            return NULL;
        }
        while (ptr < end && *ptr >= '0' && *ptr <= '9') {
            const unsigned long old = val2;
            val1 = 10UL * val2 + (*(ptr++) - '0');
            if (val2 < old) {
                free(result);
                errno = EDOM; /* Overflow! */
                return NULL;
            }
        }
        if (ptr < end) {
            /* Error unless whitespace or newline. */
            if (*ptr != '\0' && *ptr != '\t' && *ptr != '\n' &&
                *ptr != '\v' && *ptr != '\f' && *ptr != '\r' &&
                *ptr != ' ') {
                free(result);
                errno = ECOMM; /* Bad data! */
                return NULL;
            }

            /* Skip the rest of this line. */
            while (ptr < end && *ptr != '\n' && *ptr != '\r')
                ptr++;
        }

        /* Need to grow result? */
        if (used >= size) {
            items *const old = result;

            size = ITEMS_REALLOC_SIZE(used);
            result = realloc(result, sizeof (items) + size * sizeof (struct pair));
            if (!result) {
                free(old);
                errno = ENOMEM;
                return NULL;
            }
        }

        result->items[used].key = val1;
        result->items[used].value = val2;
        used++;
    }

    /* Note: we could reallocate result here,
     *       if memory use is an issue.
    */

    result->size = size;
    result->used = used;

    errno = 0;
    return result;
}

I've used a similar approach to load molecular data for visualization. Such data contains floating-point values, but precision is typically only about seven significant digits, no multiprecision math needed. A custom routine to parse such data beats the standard functions by at least an order of magnitude in speed.

At least the Linux kernel is pretty good at observing memory/file access patterns; using madvise() also helps.

If you cannot use a memory map, then the parsing function would be a bit different: it would append to an existing result, and if the final line in the buffer is partial, it would indicate so (and the number of chars not parsed), so that the caller can memmove() the buffer, read more data, and continue parsing. (Use 16-byte aligned addresses for reading new data, to maximize copy speeds. You don't necessarily need to move the unread data to the exact beginning of the buffer, you see; just keep the current position in the buffered data.)

Questions?

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
2

First, what's your disk hardware? A single SATA drive is likely to be topped out at 100 MB/sec. And probably more like 50-70 MB/sec. If you're already moving data off the drive(s) as fast as you can, all the software tuning you do is going to be wasted.

If your hardware CAN support reading faster? First, your read pattern - read the whole file into memory once - is the perfect use-case for direct IO. Open your file using open( "/file/name", O_RDONLY | O_DIRECT );. Read to page-aligned buffers (see man page for valloc()) in page-sized chunks. Using direct IO will cause your data to bypass double buffering in the kernel page cache, which is useless when you're reading that much data that fast and not re-reading the same data pages over and over.

If you're running on a true high-performance file system, you can read asynchronously and likely faster with lio_listio() or aio_read(). Or you can just use multiple threads to read - and use pread() so you don't have waste time seeking - and because when you read using multiple threads a seek on an open file affects all threads trying to read from the file.

And do not try to read fast into a newly-malloc'd chunk of memory - memset() it first. Because truly fast disk systems can pump data into the CPU faster than the virtual memory manager can create virtual pages for a process.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56