3

I am writing a program which reads a file with almost 2 million lines. The file is in the format integer ID tab with an artist name string.

6821361 Selinsgrove High School Chorus
10151460    greek-Antique
10236365    jnr walker & the all-stars
6878792 Grieg - Kraggerud, Kjekshus
6880556 Mr. Oiseau
6906305 stars on 54 (maxi single)
10584525    Jonie Mitchel
10299729    エリス レジーナ/アントニオ カルロス ジョビン

Above is an example with some items from the file (not some lines do not follow the specific format). My program work file until it gets to the last line from the example then it endlessly prints エリス レジーナ/アントニオ カルロス ジョビ\343\203.

struct artist *read_artists(char *fname)
{
    FILE *file;
    struct artist *temp = (struct artist*)malloc(sizeof(struct artist));
    struct artist *head = (struct artist*)malloc(sizeof(struct artist));
    file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
    if(file == 0)
    {
        perror("fopen");
        exit(1);
    }
    int artist_ID;
    char artist_name[650];
    while(!feof(file))
    {
        fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);
        temp = create_play(artist_ID, artist_name, 0, -1);
        head = add_play(head, temp);
        printf("%s\n", artist_name);
    }
    fclose(file);
    //print_plays(head);
    return head;
}

Above is my code for reading from the file. Can you please help explain what is wrong?

Root0x
  • 472
  • 1
  • 9
  • 28
  • 3
    [`while (!feof(file))` is nearly always wrong.](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) Bonus comment: [do not cast the result of malloc in C.](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc) – Paul R Nov 25 '15 at 14:43
  • `while (!feof(file))` is the wrong way to read a file in C. Also, check the return value from `fscanf`! And check if there really is a newline at the end of the file. – Thomas Padron-McCarthy Nov 25 '15 at 14:44
  • 1
    \343\203\263 (that's a sequence of 3 characters expressed as an octal escape sequence) would be the UTF-8 encoding of ン, so it looks like it's missed the last byte of the string for some reason. – Ian Abbott Nov 25 '15 at 15:03
  • I'd expect `fscanf(file, "%d\t%649[^\t\n]\n", &artist_ID, artist_name);` 649 vs 65 – chux - Reinstate Monica Nov 25 '15 at 15:04
  • My professor told us to use that format string so that we only use 65 characters from that string – Root0x Nov 25 '15 at 15:06
  • 1
    @Root0x but the string it's failing on is 66 bytes long.... – Ian Abbott Nov 25 '15 at 15:10
  • 1
    @ryyker 649 is 1 less than the size of `artist_name[650]` – chux - Reinstate Monica Nov 25 '15 at 15:11

2 Answers2

3

As the comments indicate, one problem is with while(!feof(file)) The linked content will explain in detail why this is not a good idea, but in summary, quoting from one of the answers in the link:

(!feof(file))...

...is wrong because it tests for something that is irrelevant and fails to test for something that you need to know. The result is that you are erroneously executing code that assumes that it is accessing data that was read successfully, when in fact this never happened. - Kerrek SB

In your case, this usage does not cause your problem, but as Kerrek explains might happen, masks it.

You can replace that with fgets(...):

char lineBuf[1000];//make length longer or shorter for your purpose
file = fopen("/Users/Daniel/Library/Developer/Xcode/DerivedData/project_Audioscrobbler_Artists-hgwyqpinuoxayzbmvarcjxryqnrz/Build/Products/Debug/artist_data.txt", "r");
if(!file) return -1;
while(fgets (lineBuf, sizeof(lineBuf), file))
{
    //process each line here
    //But processing Japanese characters
    //will require special considerations.
    //Refer to the link below for UNICODE tips
}

Unicode in C and C++...

In particular, you will need to use variable types that are sufficient for containing the different size characters you will be processing. The link discusses this in great detail.

Here is an excerpt:

"char" no longer means character
I hereby recommend referring to character codes in C programs using a 32-bit unsigned integer type. Many platforms provide a

"wchar_t" (wide character) type, but unfortunately it is to be avoided since some compilers allot it only 16 bits—not enough to represent Unicode. Wherever you need to pass around an individual character, change "char" to "unsigned int" or similar. The only remaining use for the "char" type is to mean "byte".

Edit:
In the comments above, you state but the string it's failing on is 66 bytes long. Because you are reading into a 'char' array, the bytes necessary to complete the character were truncated one byte before including the last necessary byte. ASCII characters can be contained in a single char space. Japanese characters cannot. If you were using an array of unsigned int instead of array of char, the last byte would have been included.

Community
  • 1
  • 1
ryyker
  • 22,849
  • 3
  • 43
  • 87
3

OP's code failed because the result of fscanf() was not checked.

fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name);

The fscanf() read in 65 char of "エリス レジーナ/アントニオ カルロス ジョビン". Yet this string, encoded in UTF8, has a length of 66. The last 'ン' is codes 227, 131, 179 (octal 343 203 263) and only the last 2 were read. When artist_name is printed the following appears.

エリス レジーナ/アントニオ カルロス ジョビ\343\203

Now begins the problem. The last char 179 remains in in file. On the next fscanf(), it fails as char 179 does not convert into a int ("%d"). So fscanf() returns 0. Since code did not check the result of fscanf(), it does not realize artist_ID and artist_name are left over from before and so prints the same text.

As feof() is never true for the char 179 is not consumed, we have infinite loop.

The while(!feof(file)) hid this problem, but did not cause it.

The fgets() proposed by @ryyker is a good approach. Another is:

while (fscanf(file, "%d\t%65[^\t\n]\n", &artist_ID, artist_name) == 2) {
    temp = create_play(artist_ID, artist_name, 0, -1);
    head = add_play(head, temp);
    printf("%s\n", artist_name);
    }

IOWs, validate the results of *scanf().

Community
  • 1
  • 1
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256