3

I'm trying to solve some simple problem. There is an input file equal.in. It consists of two lines: first one contains the number N of numbers in next line. Second line contains N numbers separated by single space. N is not greater than 3 * 10^5, each number is not greater than 10^9.

I'm trying to solve this program using C language. I've already made it in python like in 1 minute or so, however i struggle in C. I've made a function read_file, which should return a pointer to the array of long numbers and also change the value of size variable. The program runs smoothly while the number N is less than 10^4, when it's above that, the array is filled with zeroes except the first element. What am I doing wrong?

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

#define MAX_NUMBERS 300000
#define MAX_VALUE 1000000000

long* read_file(char*, long*);
int number_width(int);

int main() {
    long i, size;
    long* numbers = read_file("equal.in", &size);
    printf("Size: %d\n", size);
    printf("Array: ");
    for (i = 0; i < size; i++) {
        printf("%d ", numbers[i]);
    }
    printf("\n");
    free(numbers);
    return 0;
}

long* read_file(char* filename, long* size) {
    FILE* in_file = fopen(filename, "r");
    long l1_width = number_width(MAX_NUMBERS);
    char* line1 = malloc(sizeof(char) * l1_width);
    fgets(line1, l1_width, in_file);
    char *ptr;
    *size = strtol(line1, &ptr, 10);
    free(line1);
    long* numbers = malloc(sizeof(long) * *size);
    long l2_width = (number_width(MAX_VALUE) + 1) * *size;
    char* line2 = malloc(sizeof(char) * l2_width);
    fgets(line2, l2_width, in_file);
    char* token = strtok(line2, " ");
    int i = 0;
    while (token != NULL) {
        *(numbers+i) = strtol(token, &ptr, 10);
        token = strtok(NULL, " ");
        i++;
    }
    free(line2);
    fclose(in_file);
    return numbers;
}

int number_width(int n) {
    if (n < 0) n = (n == INT_MIN) ? INT_MAX : -n;
    if (n < 10) return 1;
    if (n < 100) return 2;
    if (n < 1000) return 3;
    if (n < 10000) return 4;
    if (n < 100000) return 5;
    if (n < 1000000) return 6;
    if (n < 10000000) return 7;
    if (n < 100000000) return 8;
    if (n < 1000000000) return 9;
    return 10;
}

I've deleted the (in my opinion) unnecessary code. Everything else seems to work fine. The problem only happens with the large number on the first line. If related, i made a simple script in python to make the file equal.in according to the rules, so the contents of the file are ok.

  • 1
    Having so manny Malloc (and different too) calls inside one Function doesn’t look good. You should split your Program in more functions. – Michi Nov 08 '17 at 17:02
  • Concerning "huge" lines: "An implementation shall support text files with lines containing at least 254 characters, .... The value of the macro `BUFSIZ` shall be at least 256.". If your line is longer than ` BUFSIZ`, code, using a file in text mode, may not behave as expected. – chux - Reinstate Monica Nov 08 '17 at 17:11
  • I think, you should take a good, long look at the `scanf()` manpage (`man scanf`). All you need to do, is to read the first number with `scanf()`, and then loop over another call to `scanf()` to read the array. – cmaster - reinstate monica Nov 08 '17 at 17:16
  • That’s an awful lot of code unless you have to validate that the data is on two lines as stated. Even then, it’s still an awful lot of code. If you can assume (rather than being required to check) that the data is correctly formatted, then you can reduce the code to less than a dozen lines. – Jonathan Leffler Nov 08 '17 at 17:20
  • @jonathan-leffler, i wasn't trying to validate the data. How can I do it in a simpler way? – ilyas.sadykov Nov 08 '17 at 17:41
  • From a `worst` cast POV, `long l2_width = (number_width(MAX_VALUE) + 1) * *size;` is at least 1 too small. Little reason to be so tight. Recommend 2x. Not only does this handle space for the final \n, \0, it allows spaced for values like 00000000000000000001 as values are limited to 10^n - 1, not necessarily their textual representation. – chux - Reinstate Monica Nov 08 '17 at 18:49

2 Answers2

4

There is an input file equal.in. It consists of two lines: first one contains the number N of numbers in next line. Second line contains N numbers separated by single space. N is not greater than 3 * 10^5, each number is not greater than 10^9.

This makes life easy. The values fit inside 32-bit int, and 300K numbers means that you will only need 1.2M of data. You can allocate that on the stack, with a fixed size array or a variable length array. If you insist, you can allocate it with malloc() and free it later.

Unless you need to validate that the data meets the input specification, you can do the job quite simply:

#include <stdio.h>

int main(int argc, char **argv)
{
    const char *filename = "equal.in";
    if (argc == 2)
        filename = argv[1];

    FILE *fp = fopen(filename, "r");
    if (fp == 0)
    {
        fprintf(stderr, "%s: failed to open file %s for reading\n", argv[0], filename);
        return 1;
    }

    int num;
    if (fscanf(fp, "%d", &num) != 1)
    {
        fprintf(stderr, "%s: first line of file %s does not start with a number\n", argv[0], filename);
        return 1;
    }

    int data[num];
    for (int i = 0; i < num; i++)
    {
        if (fscanf(fp, "%d", &data[i]) != 1)
        {
            fprintf(stderr, "%s: failed to read entry number %d from file %s\n", argv[0], i+1, filename);
            return 1;
        }
    }

    int length = 0;
    const char *pad = "";
    for (int i = 0; i < num; i++)
    {
        length += printf("%s%d", pad, data[i]);
        pad = " ";
        if (length > 70)
        {
            putchar('\n');
            pad = "";
            length = 0;
        }
    }
    if (length > 0)
        putchar('\n');
    return 0;
}

This uses a VLA. If your compiler doesn't have support for VLAs, then you can either go with the pessimistic approach:

enum { MAX_ENTRIES = 300000 };
int data[MAX_ENTRIES];

or the dynamic approach:

int *data = malloc(num * sizeof(*data));
if (data == 0)
{
    fprintf(stderr, "%s: failed to allocate %zu bytes of memory\n",
            argv[0], num * sizeof(*data));
    return 1;
}

…

free(data);

I created a test file with 279295 entries using a home-grown random number generator:

$ random 10000 300000 | tee equal.in
27295
$ random -n $(<equal.in) 0 999999999 | tr '\n' ' ' >> equal.in
$ echo >> equal.in
$ wc equal.in
       2  279296 2761868 equal.in
$

The first line generates a number in the range 10000 to 300000 and writes it to both equal.in and the terminal. The second line generates that many numbers (-n $(<equal.in) — there's a Bash-ism there) in the range 0 to 999,999,999, writing one number per line. The tr command maps the newlines to blanks; the final echo adds a newline to the end of the file. The wc reports that there are two lines and 279296 'words', meaning numbers, in the file.

I then ran the program, and got the output:

670206318 31176149 386272687 414856040 825173318 954016935 485458470 922293242
795866483 253363938 844512159 323292038 103572404 373917916 142021104 264196634
957800900 482861146 26824834 849885087 789023653 432837903 583262643 117607701
397156307 281517645 721527177 397482085 226290913 94898730 493928208 935264986
408834056 561990394 846038059 431925002 487972136 227567249 578463338 840243525
…
974659784 53079688 549147388 154574314 804309064 164345737 378554521 729437495
504219874 234692365 141938083 85093023 95609608 860865295 742893260 69909938
48374552 461946331 407898852 575861228 335672877 983186286 679276932 946629117
247591685 299343487 335924507 161837591 435945210 340851167 747313445 454000003
837746407 249404999 860823559 923922564 150303869 762266074 739320218

The longest line in the output was 79 characters (excluding the newline). With the output redirected to /dev/null, it took about 0.124 seconds to read and print that data. With the output going to screen, it took about 0.214s.

Note the error reporting, including the program name in the error reports, which are written to standard error. I avoided using exit() but would normally need to do so. In my own code, I would replace those 4-line blocks of error reporting code with single line calls to the functions from stderr.c and stderr.h available from GitHub. I'd probably also validate the argument list, objecting to more than one argument rather than simply ignore all the arguments. Like this:

#include <stdio.h>
#include "stderr.h"

int main(int argc, char **argv)
{
    err_setarg0(argv[0]);

    const char *filename = "equal.in";
    if (argc == 2)
        filename = argv[1];
    else if (argc > 2)
        err_usage("[file]");

    FILE *fp = fopen(filename, "r");
    if (fp == 0)
        err_error("failed to open file %s for reading\n", filename);

    int num;
    if (fscanf(fp, "%d", &num) != 1)
        err_error("first line of file %s does not start with a number\n", filename);

    int data[num];
    for (int i = 0; i < num; i++)
    {
        if (fscanf(fp, "%d", &data[i]) != 1)
            err_error("failed to read entry number %d from file %s\n", i+1, filename);
    }

    int length = 0;
    const char *pad = "";
    for (int i = 0; i < num; i++)
    {
        length += printf("%s%d", pad, data[i]);
        pad = " ";
        if (length > 70)
        {
            putchar('\n');
            pad = "";
            length = 0;
        }
    }
    if (length > 0)
        putchar('\n');
    return 0;
}

Note that reporting the file name can help the user identify which file to look at. That means you should almost never use a string literal for the file name passed to fopen() because you would have to repeat it in the error reporting, which isn't good.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Thank you a lot sir! I haven't even estimated this thorough explanation. You did it like a master. Thank you very much. – ilyas.sadykov Nov 08 '17 at 18:50
  • A short-coming to the `fscanf(fp, "%d"...` approach is the loss of `'\n'` detection and other validation as said here with "Unless you need to validate ....". OP's original `fgets()` could lend itself to better error detection and localization. Still this answer does work well with with formed file data. – chux - Reinstate Monica Nov 08 '17 at 18:54
  • @chux: Agreed, but the whole answer is predicated on "unless you need to validate that the data meets the format specification". That said, it would not be hard to upgrade it to use POSIX [`getline()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html) to read the lines and then [use `sscanf()` in a loop](https://stackoverflow.com/questions/3975236/how-to-use-sscanf-in-loops) to process the data, remembering to check that there is no more data after the first number on the first line, that there is no more data after the last number on the second line, and no more lines. – Jonathan Leffler Nov 08 '17 at 21:46
4

The problem that you are observing is:

The program runs smoothly while the number N is less than 10^4, when it's above that, the array is filled with zeroes except the first element.

It is occurring because of this:

long l1_width = number_width(MAX_NUMBERS);
char* line1 = malloc(sizeof(char) * l1_width);
fgets(line1, l1_width, in_file);

Here the l1_width will always be 6 because the MAX_NUMBERS is 300000.

fgets:

Syntax : char * fgets ( char * str, int num, FILE * stream );

Description : Reads characters from stream and stores them as a C string into str until (num-1) characters have been read or either a newline or the end-of-file is reached, whichever happens first.

Now, consider the scenario when the first line of your file equal.in contains number less than 10^4 i.e. 1000 or less, the program works fine.

So, assume the first line of file equal.in contains 1000.

Total number of characters in the first line of file equal.in - 5 characters (1000 + newline).

For this program works fine because the memory allocated to line1 is of 6 characters and fgets reads until num - 1 (i.e. 5) characters.

So, while reading 1000 as the first line, fgets hits newline character.

But when the number is greater than or equal to 10^4, the fgets hits num - 1 before hitting newline character as the l1_width is 6.

Now the point to note here is that the rest of characters + newline character of the first line of file equal.in is yet to be read.

In next call to fgets():

fgets(line2, l2_width, in_file);

It reads the rest of characters of the first line of file equal.in and as the l2_width is a big number, fgets() hits newline of the first line of equal.in file and line2 contain remains of the first line only.

And it never happens to read the second line of file equal.in when the number is greater than or equal to 10^4 in the first line of equal.in file.

To fix this problem first, you need to return the width + 1 from number_width() function.

It should be:

int number_width(int n) {
    if (n < 0) n = (n == INT_MIN) ? INT_MAX : -n;
    if (n < 10) return 2;
    if (n < 100) return 3;
    if (n < 1000) return 4;
    if (n < 10000) return 5;
    if (n < 100000) return 6;
    if (n < 1000000) return 7;
    if (n < 10000000) return 8;
    if (n < 100000000) return 9;
    if (n < 1000000000) return 10;
    return 11;
}

Second, you need to allocate the buffer to line1 of l1_width+1:

fgets(line1, l1_width+1, in_file);

As I can see that Jonathan has already given a better way to process equal.in file. This answer is just to let you know where the problem is occurring in your code so that in future you take care of such things.

Community
  • 1
  • 1
H.S.
  • 11,654
  • 2
  • 15
  • 32
  • Another superb comment. I really appreciate your answer, it helped a lot. Now i know where i was wrong. Thank you a lot. – ilyas.sadykov Nov 08 '17 at 20:07