1

Hey guys I'm attempting to read in workersinfo.txt and store it into a two-dimensional char array. The file is around 4,000,000 lines with around 100 characters per line. I want to store each file line on the array. Unfortunately, I get exit code 139(Not enough memory). I'm aware I have to use malloc() and free() but I've tried a couple of things and I haven't been able to make them work.Eventually I have to sort the array by ID number but I'm stuck on declaring the array. The file looks something like this:

First Name, Last Name,Age, ID
Carlos,Lopez,,10568
Brad, Patterson,,20586
Zack, Morris,42,05689

This is my code so far:

#include <stdio.h>
#include <stdlib.h>

int main(void) {


FILE *ptr_file;
char workers[4000000][1000];




ptr_file =fopen("workersinfo.txt","r");
if (!ptr_file)
    perror("Error");


int i = 0;
while (fgets(workers[i],1000, ptr_file)!=NULL){

    i++;
}

int n;
for(n = 0; n < 4000000; n++)
{
    printf("%s", workers[n]);
}

fclose(ptr_file);
return 0;
}
MikeCAT
  • 73,922
  • 11
  • 45
  • 70
David
  • 173
  • 2
  • 5
  • 14
  • 1
    change `char workers[4000000][1000];` to `static char workers[4000000][1000];`. And make sure you are compiling for 64-bit target and your system has well over 4GB of RAM – M.M Mar 11 '16 at 01:58
  • Possible duplicate of [Segmentation fault on large array sizes](http://stackoverflow.com/questions/1847789/segmentation-fault-on-large-array-sizes) – zneak Mar 11 '16 at 02:00
  • Why do you allocate as large as 1000 bytes when each lines are around 100 characters? For supporting Unicode characters represented by many bytes? – MikeCAT Mar 11 '16 at 02:05
  • 1
    What's the actual file size? And how much RAM does your computer have? – user3386109 Mar 11 '16 at 02:09
  • Yes I should change the second dimension to 100 instead. And the file size is 160 mb and I have 6 gb of ram – David Mar 11 '16 at 02:26
  • You are trying to allocate an array of 4,000,000 arrays of 1,000 characters each. That's 4GB of space! I can't conceive of a simple application that would need anywhere near that kind of memory. You really should take the time to learn how to use `malloc` and `free` - it will save you a heckuva lot of trouble! – templatetypedef Mar 11 '16 at 02:36
  • I changed it to 100 characters each so that should be under 2gb of memory. Either way yes I'm trying to figure out how to use malloc and free – David Mar 11 '16 at 02:52
  • Also, since I'm attempting to sort by ID should I make a structure instead? – David Mar 11 '16 at 03:02
  • It would probably be more reasonable to read and process the file in chunks, rather than reading up to 400MB of data in one go – Mac O'Brien Mar 11 '16 at 03:11

1 Answers1

2

The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.

you can use ulimit to review the limits of your system (usually including the stack size limit).

On my Mac, the limit is 8Mb. After running ulimit -a I get:

...
stack size              (kbytes, -s) 8192
...

Or, test the limit using:

struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit

I truly recommend you process each database entry separately.

As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.

Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.


Follow-up Q1: How to deal with each DB entry separately

I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:

The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).

Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).

As you can see from the code, we read each line separately and perform error checks to check that the data is valid.

To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...

... The binary tree might get a bit big. did you look up sorting algorithms?

#include <stdio.h>

// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
  char* last_name;        // a pointer to the last name
  char* age;              // a pointer to the name - could probably be an int
  char* id;               // a pointer to the ID
  char first_name[1024];  // the actual buffer...
  // I unified the first name and the buffer since the first name is first.
};

// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
  if (!fgets(line->first_name, 1024, fp))
    return 0;
  // parse data and review for possible overflow.

  // first, zero out data
  int pos = 0;
  line->age = NULL;
  line->id = NULL;
  line->last_name = NULL;

  // read each byte, looking for the EOL marker and the ',' seperators
  while (pos < 1024) {
    if (line->first_name[pos] == ',') {
      // we encountered a devider. we should handle it.

      // if the ID feild's location is already known, we have an excess comma.
      if (line->id) {
        fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
        return 0;
      }
      // replace the comma with 0 (seperate the strings)
      line->first_name[pos] = 0;
      if (line->age)
        line->id = line->first_name + pos + 1;
      else if (line->last_name)
        line->age = line->first_name + pos + 1;
      else
        line->last_name = line->first_name + pos + 1;
    } else if (line->first_name[pos] == '\n') {
      // we encountered a terminator. we should handle it.
      if (line->id) {
        // if we have the id string's possition (the start marker), this is a
        // valid entry and we should process the data.
        line->first_name[pos] = 0;
        return 1;
      } else {
        // we reached an EOL without enough ',' seperators, this is an invalid
        // line.
        fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
        return 0;
      }
    }
    pos++;
  }
  // we ran through all the data but there was no EOL marker...
  fprintf(stderr,
          "Parsing error, invalid data (data overflow or data too large).\n");
  return 0;
}

// the main program
int main(int argc, char const* argv[]) {
  // open file
  FILE* ptr_file;
  ptr_file = fopen("workersinfo.txt", "r");
  if (!ptr_file)
    perror("File Error");

  struct DBEntry line;

  while (read_db_line(ptr_file, &line)) {
    // do what you want with the data... print it?
    printf(
        "First name:\t%s\n"
        "Last name:\t%s\n"
        "Age:\t\t%s\n"
        "ID:\t\t%s\n"
        "--------\n",
        line.first_name, line.last_name, line.age, line.id);
  }

  // close file
  fclose(ptr_file);
  return 0;
}

Followup Q2: Sorting array for 400MB-4GB of data

IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).

Creating an Array of DBEntry objects will eventually get you a larger memory foot-print then the actual data..

This will not be the optimal way to sort large data.

The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.

Since we are handling a large amount of data, there are a few things to consider:

  1. It would make sense to partition the work, so different threads/processes sort a different section of the data.

  2. We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).

One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.

Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.

Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.

Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.

Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...

On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.

#include <stdlib.h>
#include <stdio.h>

// assuming this is the file structure:
//
//    First Name, Last Name,Age, ID
//    Carlos,Lopez,,10568
//    Brad, Patterson,,20586
//    Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
  char* last_name;        // a pointer to the last name
  char* age;              // a pointer to the name - could probably be an int
  char* id;               // a pointer to the ID
  char first_name[1024];  // the actual buffer...
  // I unified the first name and the buffer since the first name is first.
};

// this might be a sorting node for a sorted bin-tree:
struct SortNode {
  struct SortNode* next;  // a pointer to the next node
  fpos_t position;        // the DB entry's position in the file
  long value;             // The computed sorting value
}* top_sorting_node = NULL;

// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
  struct SortNode* node;
  // as long as there is a first node...
  while ((node = top_sorting_node)) {
    // step forward.
    top_sorting_node = top_sorting_node->next;
    // free the original first node's memory
    free(node);
  }
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
  if (!fgets(line->first_name, 1024, fp))
    return -1;
  // parse data and review for possible overflow.

  // first, zero out data
  int pos = 0;
  line->age = NULL;
  line->id = NULL;
  line->last_name = NULL;

  // read each byte, looking for the EOL marker and the ',' seperators
  while (pos < 1024) {
    if (line->first_name[pos] == ',') {
      // we encountered a devider. we should handle it.

      // if the ID feild's location is already known, we have an excess comma.
      if (line->id) {
        fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
        clear_sort_heap();
        exit(2);
      }
      // replace the comma with 0 (seperate the strings)
      line->first_name[pos] = 0;
      if (line->age)
        line->id = line->first_name + pos + 1;
      else if (line->last_name)
        line->age = line->first_name + pos + 1;
      else
        line->last_name = line->first_name + pos + 1;
    } else if (line->first_name[pos] == '\n') {
      // we encountered a terminator. we should handle it.
      if (line->id) {
        // if we have the id string's possition (the start marker), this is a
        // valid entry and we should process the data.
        line->first_name[pos] = 0;
        return 0;
      } else {
        // we reached an EOL without enough ',' seperators, this is an invalid
        // line.
        fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
        clear_sort_heap();
        exit(1);
      }
    }
    pos++;
  }
  // we ran through all the data but there was no EOL marker...
  fprintf(stderr,
          "Parsing error, invalid data (data overflow or data too large).\n");
  return 0;
}

// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
  // allocate the memory for the node - use calloc for zero-out data
  struct SortNode* node = calloc(sizeof(*node), 1);
  // store the position on file
  fgetpos(fp, &node->position);
  // use a stack allocated DBEntry for processing
  struct DBEntry line;
  // check that the read succeeded (read_db_line will return -1 on error)
  if (read_db_line(fp, &line)) {
    // free the node's memory
    free(node);
    // return no data (0)
    return 0;
  }
  // compute sorting value - I'll assume all IDs are numbers up to long size.
  sscanf(line.id, "%ld", &node->value);

  // heap sort?

  // This is a questionable sort algorythm... or a questionable implementation.
  // Also, I'll be using pointers to pointers, so it might be a headache to read
  // (it's a headache to write, too...) ;-)
  struct SortNode** tmp = &top_sorting_node;
  // move up the list until we encounter something we're smaller then us,
  // OR untill the list is finished.
  while (*tmp && (*tmp)->value <= node->value)
    tmp = &((*tmp)->next);
  // update the node's `next` value.
  node->next = *tmp;
  // inject the new node into the tree at the position we found
  *tmp = node;
  // return 1 (data was read and sorted)
  return 1;
}

// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
  struct SortNode* node = top_sorting_node;
  if (!node)   // are we done? top_sorting_node == NULL ?
    return 0;  // return 0 - no data to write
  // step top_sorting_node forward
  top_sorting_node = top_sorting_node->next;
  // read data from one file to the other
  fsetpos(from, &node->position);
  char* buffer = NULL;
  ssize_t length;
  size_t buff_size = 0;
  length = getline(&buffer, &buff_size, from);
  if (length <= 0) {
    perror("Line Copy Error - Couldn't read data");
    return 0;
  }
  fwrite(buffer, 1, length, to);
  free(buffer);  // getline allocates memory that we're incharge of freeing.
  return 1;
}

// the main program
int main(int argc, char const* argv[]) {
  // open file
  FILE *fp_read, *fp_write;
  fp_read = fopen("workersinfo.txt", "r");
  fp_write = fopen("sorted_workersinfo.txt", "w+");
  if (!fp_read) {
    perror("File Error");
    goto cleanup;
  }
  if (!fp_write) {
    perror("File Error");
    goto cleanup;
  }

  printf("\nSorting");
  while (sort_line(fp_read))
    printf(".");
  // write all sorted data to a new file
  printf("\n\nWriting sorted data");
  while (write_line(fp_write, fp_read))
    printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
    printf("\n");
  fclose(fp_read);
  fclose(fp_write);
  clear_sort_heap();
  return 0;
}
Community
  • 1
  • 1
Myst
  • 18,516
  • 2
  • 45
  • 67
  • What do you mean by processing each database entry separately? – David Mar 11 '16 at 02:43
  • @Isaac - I edited my answer. You can read through the example code to get the gist of it, I guess. – Myst Mar 11 '16 at 03:30
  • Thanks you so much – David Mar 11 '16 at 04:31
  • I realize we look at each line individually and store the remaining part of the string to the next struct variable after each comma. After a line, I want to store the parsed data into the respective array variable. Would I have to create a line1 of type DBentry, line2 of type DBentry and so on? For example: 'struct DBEntry line' 'struct DBEntry line1' 'struct DBEntry line2' for all 4000000 lines? – David Mar 11 '16 at 23:16
  • @Isaac , I edited the answer for your added question about storing the data in an array (array is bad, [binary tree](https://en.wikipedia.org/wiki/Binary_tree) is good)... You can also see the example code and the use of `malloc` and `free`. – Myst Mar 12 '16 at 01:14
  • I will look into sorting algorithms right now. And ultimately, I'm looking at partitioning the data with fork() and exec() – David Mar 12 '16 at 01:23
  • Good luck @Isaac ! P.S. I'm curious to know how long my demo sort takes to sort all the data. If you ever measure, let me know :-) – Myst Mar 12 '16 at 01:31
  • I totally will measure this weekend and will let you know! – David Mar 12 '16 at 01:57