The Stack memory is limited. As you pointed out in your question, you MUST use malloc to allocate such a big (need I say HUGE) chunk of memory, as the stack cannot contain it.
you can use ulimit
to review the limits of your system (usually including the stack size limit).
On my Mac, the limit is 8Mb. After running ulimit -a
I get:
...
stack size (kbytes, -s) 8192
...
Or, test the limit using:
struct rlimit slim;
getrlimit(RLIMIT_STACK, &rlim);
rlim.rlim_cur // the stack limit
I truly recommend you process each database entry separately.
As mentioned in the comments, assigning the memory as static memory would, in most implementations, circumvent the stack.
Still, IMHO, allocating 400MB of memory (or 4GB, depending which part of your question I look at), is bad form unless totally required - especially for a single function.
Follow-up Q1: How to deal with each DB entry separately
I hope I'm not doing your homework or anything... but I doubt your homework would include an assignment to load 400Mb of data to the computer's memory... so... to answer the question in your comment:
The following sketch of single entry processing isn't perfect - it's limited to 1Kb of data per entry (which I thought to be more then enough for such simple data).
Also, I didn't allow for UTF-8 encoding or anything like that (I followed the assumption that English would be used).
As you can see from the code, we read each line separately and perform error checks to check that the data is valid.
To sort the file by ID, you might consider either running two lines at a time (this would be a slow sort) and sorting them, or creating a sorted node
tree with the ID data and the position of the line in the file (get the position before reading the line). Once you sorted the binary tree, you can sort the data...
... The binary tree might get a bit big. did you look up sorting algorithms?
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 1 on sucesss or 0 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return 0;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
return 0;
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 1;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
return 0;
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE* ptr_file;
ptr_file = fopen("workersinfo.txt", "r");
if (!ptr_file)
perror("File Error");
struct DBEntry line;
while (read_db_line(ptr_file, &line)) {
// do what you want with the data... print it?
printf(
"First name:\t%s\n"
"Last name:\t%s\n"
"Age:\t\t%s\n"
"ID:\t\t%s\n"
"--------\n",
line.first_name, line.last_name, line.age, line.id);
}
// close file
fclose(ptr_file);
return 0;
}
Followup Q2: Sorting array for 400MB-4GB of data
IMHO, 400MB is already touching on the issues related to big data. For example, implementing a bubble sort on your database should be agonizing as far as performance goes (unless it's a single time task, where performance might not matter).
Creating an Array of DBEntry
objects will eventually get you a larger memory foot-print then the actual data..
This will not be the optimal way to sort large data.
The correct approach will depend on your sorting algorithm. Wikipedia has a decent primer on sorting algorythms.
Since we are handling a large amount of data, there are a few things to consider:
It would make sense to partition the work, so different threads/processes sort a different section of the data.
We will need to minimize IO to the hard drive (as it will slow the sorting significantly and prevent parallel processing on the same machine/disk).
One possible approach is to create a heap for a heap sort, but only storing a priority value and storing the original position in the file.
Another option would probably be to employ a divide and conquer algorithm, such as quicksort, again, only sorting a computed sort value and the entry's position in the original file.
Either way, writing a decent sorting method will be a complicated task, probably involving threading, forking, tempfiles or other techniques.
Here's a simplified demo code... it is far from optimized, but it demonstrates the idea of the binary sort-tree that holds the sorting value and the position of the data in the file.
Be aware that using this code will be both relatively slow (although not that slow) and memory intensive...
On the other hand, it will require about 24 bytes per entry. For 4 million entries, it's 96MB, somewhat better then 400Mb and definitely better then the 4GB.
#include <stdlib.h>
#include <stdio.h>
// assuming this is the file structure:
//
// First Name, Last Name,Age, ID
// Carlos,Lopez,,10568
// Brad, Patterson,,20586
// Zack, Morris,42,05689
//
// Then this might be your data structure per line:
struct DBEntry {
char* last_name; // a pointer to the last name
char* age; // a pointer to the name - could probably be an int
char* id; // a pointer to the ID
char first_name[1024]; // the actual buffer...
// I unified the first name and the buffer since the first name is first.
};
// this might be a sorting node for a sorted bin-tree:
struct SortNode {
struct SortNode* next; // a pointer to the next node
fpos_t position; // the DB entry's position in the file
long value; // The computed sorting value
}* top_sorting_node = NULL;
// this function will free all the memory used by the global Sorting tree
void clear_sort_heap(void) {
struct SortNode* node;
// as long as there is a first node...
while ((node = top_sorting_node)) {
// step forward.
top_sorting_node = top_sorting_node->next;
// free the original first node's memory
free(node);
}
}
// each time you read only a single line, perform an error check for overflow
// and return the parsed data.
//
// return 0 on sucesss or 1 on failure.
int read_db_line(FILE* fp, struct DBEntry* line) {
if (!fgets(line->first_name, 1024, fp))
return -1;
// parse data and review for possible overflow.
// first, zero out data
int pos = 0;
line->age = NULL;
line->id = NULL;
line->last_name = NULL;
// read each byte, looking for the EOL marker and the ',' seperators
while (pos < 1024) {
if (line->first_name[pos] == ',') {
// we encountered a devider. we should handle it.
// if the ID feild's location is already known, we have an excess comma.
if (line->id) {
fprintf(stderr, "Parsing error, invalid data - too many fields.\n");
clear_sort_heap();
exit(2);
}
// replace the comma with 0 (seperate the strings)
line->first_name[pos] = 0;
if (line->age)
line->id = line->first_name + pos + 1;
else if (line->last_name)
line->age = line->first_name + pos + 1;
else
line->last_name = line->first_name + pos + 1;
} else if (line->first_name[pos] == '\n') {
// we encountered a terminator. we should handle it.
if (line->id) {
// if we have the id string's possition (the start marker), this is a
// valid entry and we should process the data.
line->first_name[pos] = 0;
return 0;
} else {
// we reached an EOL without enough ',' seperators, this is an invalid
// line.
fprintf(stderr, "Parsing error, invalid data - not enough fields.\n");
clear_sort_heap();
exit(1);
}
}
pos++;
}
// we ran through all the data but there was no EOL marker...
fprintf(stderr,
"Parsing error, invalid data (data overflow or data too large).\n");
return 0;
}
// read and sort a single line from the database.
// return 0 if there was no data to sort. return 1 if data was read and sorted.
int sort_line(FILE* fp) {
// allocate the memory for the node - use calloc for zero-out data
struct SortNode* node = calloc(sizeof(*node), 1);
// store the position on file
fgetpos(fp, &node->position);
// use a stack allocated DBEntry for processing
struct DBEntry line;
// check that the read succeeded (read_db_line will return -1 on error)
if (read_db_line(fp, &line)) {
// free the node's memory
free(node);
// return no data (0)
return 0;
}
// compute sorting value - I'll assume all IDs are numbers up to long size.
sscanf(line.id, "%ld", &node->value);
// heap sort?
// This is a questionable sort algorythm... or a questionable implementation.
// Also, I'll be using pointers to pointers, so it might be a headache to read
// (it's a headache to write, too...) ;-)
struct SortNode** tmp = &top_sorting_node;
// move up the list until we encounter something we're smaller then us,
// OR untill the list is finished.
while (*tmp && (*tmp)->value <= node->value)
tmp = &((*tmp)->next);
// update the node's `next` value.
node->next = *tmp;
// inject the new node into the tree at the position we found
*tmp = node;
// return 1 (data was read and sorted)
return 1;
}
// writes the next line in the sorting
int write_line(FILE* to, FILE* from) {
struct SortNode* node = top_sorting_node;
if (!node) // are we done? top_sorting_node == NULL ?
return 0; // return 0 - no data to write
// step top_sorting_node forward
top_sorting_node = top_sorting_node->next;
// read data from one file to the other
fsetpos(from, &node->position);
char* buffer = NULL;
ssize_t length;
size_t buff_size = 0;
length = getline(&buffer, &buff_size, from);
if (length <= 0) {
perror("Line Copy Error - Couldn't read data");
return 0;
}
fwrite(buffer, 1, length, to);
free(buffer); // getline allocates memory that we're incharge of freeing.
return 1;
}
// the main program
int main(int argc, char const* argv[]) {
// open file
FILE *fp_read, *fp_write;
fp_read = fopen("workersinfo.txt", "r");
fp_write = fopen("sorted_workersinfo.txt", "w+");
if (!fp_read) {
perror("File Error");
goto cleanup;
}
if (!fp_write) {
perror("File Error");
goto cleanup;
}
printf("\nSorting");
while (sort_line(fp_read))
printf(".");
// write all sorted data to a new file
printf("\n\nWriting sorted data");
while (write_line(fp_write, fp_read))
printf(".");
// clean up - close files and make sure the sorting tree is cleared
cleanup:
printf("\n");
fclose(fp_read);
fclose(fp_write);
clear_sort_heap();
return 0;
}