dirent not working with unicode

Question

i try to count files in folder, but readdir function skip on files that contains unicode characters. I am using dirent, in c.

int filecount(char* path)
{
    int file_Count=0;
    DIR* dirp;
    struct dirent * entry;
    dirp = opendir(path);
    while((entry=readdir(dirp)) !=NULL)
    {
        if(entry->d_type==DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

Welcome to Stack Overflow. Please read the [About] page soon. One of the first questions that springs to mind is 'which platform are you working on?' The next question is 'how have you determined that it is skipping Unicode file names'? Your code doesn't show the names being printed. You've not given much indication of example file names that you created with non-ASCII names. A related question is: are you using UTF-8 or UTF-16 names — which is partially related to the platform question; Linux and Unix use UTF-8; Windows may use UTF-16 instead. — Jonathan Leffler, Jan 07 '14 at 19:31

score 4 · Answer 1 · answered Jan 07 '14 at 20:31

Testing on Mac OS X 10.9.1 Mavericks, I adapted your code into the following complete program:

#include <dirent.h>
#include <stdio.h>

static
int filecount(char *path)
{
    int file_Count = 0;
    DIR *dirp;
    struct dirent *entry;
    dirp = opendir(path);
    while ((entry = readdir(dirp)) != NULL)
    {
        printf("Found (%llu)(%d): %s\n", entry->d_ino, entry->d_type, entry->d_name);
        if (entry->d_type == DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

static void proc_dir(char *dir)
{
    printf("Processing %s:\n", dir);
    printf("File count = %d\n", filecount(dir));
}

int main(int argc, char **argv)
{
    if (argc > 1)
    {
        for (int i = 1; i < argc; i++)
            proc_dir(argv[i]);
    }
    else
        proc_dir(".");
    return 0;
}

Notably, it lists each entry as it is returned — inode, type and name. On Mac OS X, I got told that the inode type was __uint64_t aka unsigned long long, hence the use of %llu for the format; YMMV on that.

I also created a folder utf8 and in the folder created files:

total 32
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 ÿ-y-umlaut
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 £
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 €
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 ™

Each file contained Hello plus a newline. When I run the command (I called it fc), it gives:

$ ./fc utf8
Processing utf8:
Found (8138036)(4): .
Found (377579)(4): ..
Found (8138046)(8): ÿ-y-umlaut
Found (8138067)(8): £
Found (8138054)(8): €
Found (8138078)(8): ™
File count = 4
$

The Euro symbol € is U+20AC EURO SIGN, which is way outside the range of ordinary single-byte code sets. The pound symbol £ is U+00A3 POUND SIGN, so that's in the range of the Latin 1 alphabet (ISO 8859-1, 8859-15). The trademark symbol ™ is U+2122 TRADE MARK SIGN, also outside the range of ordinary single-byte code sets.

This shows that on at least some platforms, readdir() works perfectly well with UTF-8 encoded file names using Unicode characters that are not in the Latin1 character set. It also demonstrates how I'd go about debugging the problem — and/or illustrates what I'd like you to run (the program above) and the sort of directory you should run it on to make your case that readdir() on your platform does not like Unicode file names.

herohuyongtao · Answer 2 · 2014-01-07T19:38:31.067

2

Try to change

if(entry->d_type==DT_REG)

to

if((entry->d_type==DT_REG || entry->d_type==DT_UNKNOWN) 
    && strcmp(entry->d_name,".")==0 && strcmp(entry->d_name,"..")==0)

which should enable you to count these files by further counting files of unknown types.

Note that, strcmp(entry->d_name,".")==0 and strcmp(entry->d_name,"..")==0 are used to exclude sub-directories.

edited Jan 07 '14 at 19:38

answered Jan 07 '14 at 19:20

herohuyongtao

49,413
29
133
174

now, sometimes it's return number that higher from the actual number of files – user3170449 Jan 07 '14 at 19:26
Updated to exclude `.` and `..`. – herohuyongtao Jan 07 '14 at 19:39
thankks for help, but the name of unicode files that in dirent* entry is "?" and also for the files that the function think they are actual files. Also i searched for hidden files in the directory but there isn't. – user3170449 Jan 07 '14 at 20:05
smartass remark: neither . nor .. are subdirectories, and the code does not exclude any subdirectories. – Remember Monica Oct 09 '15 at 16:42

dirent not working with unicode

2 Answers2

Linked