0

I have a csv file that includes Korean characters. But I am not sure how Korean can be printed in the code that I have.

The csv file looks like this:

name,hp,damage
대학오리,20,5
대학냥이,30,10
시계탑기린,100,20

My code:

#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
    char name[1000];
    int hp;
    int damage;
} Monster;

typedef struct {
    char header1[sizeof "name"];
    char header2[sizeof "hp"];
    char header3[sizeof "damage"];
} Header;
int main()
{
    FILE* fp = fopen("entityData.csv", "r");
    if (!fp) {
        printf("Error opening file\n");
        return 1;
    }

    Monster monsters[100];
    int num_records = 0;

    char line[100];
    Header header;
    fgets(line, sizeof line, fp);
    strncpy(header.header1, strtok(line, ","), sizeof header.header1);
    strncpy(header.header2, strtok(NULL, ","), sizeof header.header2);
    strncpy(header.header3, strtok(NULL, "\n"), sizeof header.header3);

    while (fgets(line, sizeof(line), fp))
    {
        char* token = strtok(line, ","); //, 기준으로 나눠서 token에 저장
        strncpy(monsters[num_records].name, token, 20);

        token = strtok(NULL, ",");
        monsters[num_records].hp = atoi(token);

        token = strtok(NULL, ",");
        monsters[num_records].damage = atoi(token);

        num_records++;
    }

    for (int i = 0; i < num_records; i++)
    {
        printf("%s:%s %s:%d %s:%d\n",
            header.header1, monsters[i].name,
            header.header2, monsters[i].hp,
            header.header3, monsters[i].damage);
    }
        

    fclose(fp);
    return 0;
}

The program I wrote reads the csv file above and should print it like this:

name:대학오리 hp:20 damage:5
name:대학냥이 hp:30 damage:10
name:시계탑기린 hp:100 damage:20

Instead the name part is broken.

After some searching around, I realized that Korean letters take up 2 bytes per letter, which does not match char types. I have tried using wchar but that has led to errors, and I feel like that I am stuck.

I know that asking such a question on an English website isn't the best, but I'm really just hoping if anyone knows anything.

Mr Cake
  • 11
  • 1
  • 1
    https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ – n. m. could be an AI May 14 '23 at 16:10
  • You can also try [searching this site](https://stackoverflow.com/search?q=%5Bc%5D+%5Bwindows%5D+unicode+read+from+file). – n. m. could be an AI May 14 '23 at 16:20
  • Do also take a look at your previous question. The code would benefit from the Allan's suggestions. – Harith May 14 '23 at 16:40
  • Just splitting on comma isn't the most robust way to read CSV, because CSV allows for columns that contain a comma (even newlines are possible, depending on your definition of "CSV", so even reading line by line can be an issue). Works ok if you have control over the inputs and disallow commas in column data. – Paul Dempsey May 15 '23 at 00:55
  • Well, CSV _means_ comma-separated. You can have commas and new lines in fields only if they are double-quoted. There is an [RFC for this](https://datatracker.ietf.org/doc/html/rfc4180). Yes, reading "lines" doesn't work robustly, without paying attention to double quotes. – Mark Adler May 15 '23 at 06:14

1 Answers1

2

There's nothing wrong with your code. It's Windows that's messed up. (It works perfectly fine on Linux and Macs.) Do this to remedy the problem with Windows:

Enable the new UTF-8 option in Windows settings. Go to the language settings, click Administrative language settings, then Change system locale… and tick the Beta: Use Unicode UTF-8 for worldwide language support option. Restart your computer.

Then languages in UTF-8 will display correctly in terminals.

Yes, the number of bytes can be more than the number of characters. They are likely stored as UTF-8, which encodes each character in one to four bytes. Each of your Korean characters is three bytes (not two). However a comma is still a comma and cannot appear inside another character code, so you would be correctly finding the end of your name string.

See this answer for more (much more) on character encodings in Windows.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • You don't really know whether the file is encoded as UTF-8. On Windows this is not a given. – n. m. could be an AI May 14 '23 at 17:47
  • You can't deterministically know an arbitrary file's encoding on any system, period. Not just whether it's UTF-8, and files can come from anywhere because internet. However, checking if a file is _valid_ UTF-8 (decodes to Unicode) is a pretty good heuristic. if it fails, then if it's not Unicode with a file signature, there's a decent chance it's the same codepage as the current system codepage. – Paul Dempsey May 15 '23 at 00:43
  • @PaulDempsey Fortunately OP doesn't need to. There is one single file to work with, and it is known what *abstract characters* it contains, so OP can determine the encoding. We cannot, because pasting the content of the file to an SO question doesn't preserve its original encoding. – n. m. could be an AI May 15 '23 at 09:02