1

I am trying to load multiple .txt files into a corpus. I've set up the working directory and then have the following to load the files:

filenames <- list.files(getwd(),pattern="*.txt", full.names=FALSE)

The problem is, some of the text file names have special characters (they are people's names), and I can't find a way to change the encoding to UTF-8 with list.files(), and I'm not sure how to load in many .txt files without list.files(). I also can't remove the special characters in this case.

Any suggestions? Thanks in advance!

Edit: Working in Windows

1 Answers1

1

The pattern argument won't work if the encoding is wrong. Use list.files() without pattern=... and you can at least get character strings from the mis-encoded filenames that you can then work with and possibly fix in R.

This is a minimal demonstating exmaple (needs the convmv system command to set up the test case)

    dir.create( wd <- tempfile() )
    setwd(wd)

    convmv <- Sys.which("convmv")
    if( convmv == "" )
        stop("Need the convmv available to continue")

    f1 <- "æøå.txt"
    cat( "foo\n", file=f1 )
    system2( convmv, args=c("-f", "utf8", "-t", "latin1", "--notest", f1) )

    f2 <- "ÆØÅ.txt"
    cat( "bar\n", file=f2 )

    plain.list.files <- list.files()
    stopifnot( length( plain.list.files ) == 2 )

    with.pattern.list.files <- list.files( pattern="\\.txt" )
    stopifnot( length( with.pattern.list.files ) == 1 )

Fixing the character set can be done, but I'm not sure if you're asking about that at this point.

EDIT: Actually working with or fixing these filenames:

Now that you can read the files, how bad they may be, if you know they are latin1 for example, the following might be of help. Ironically detect_str_enc doesn't get it right (and I found no good alternative), but if you know that any filename that isn't ASCII or UTF-8, will be latin1, then this might be a working fix for you:

    library(uchardet)

    hard.coded.encoding <- "latin1"

    nice.filenames <- sapply( plain.list.files, function(fname) {
        if( !detect_str_enc(fname) %in% c("ASCII","UTF-8") ) {
            Encoding(fname) <- hard.coded.encoding
        }
        return( fname )
    })

    ## Now its presumably safe to look for our pattern:
    i.txt <- grepl( "\\.txt$", nice.filenames )

    ## And we can now work with the files and present them nicely:

    file.data <- lapply( plain.list.files[i.txt], function(fname) {
        ## Do what you want to do with the file here:
        readLines( fname )
    })
    names(file.data) <- nice.filenames[i.txt]
Sirius
  • 5,224
  • 2
  • 14
  • 21
  • I don’t think this solution will help: on systems that support convmv, R already supports UTF-8 filenames just fine. The issue is either that OP is using Windows where filenames are internally UTF-16 (with some caveats which I can’t remember), or that the filenames are not valid UTF-8 to begin with (which is possible on some filesystems). – Konrad Rudolph Mar 05 '21 at 09:06
  • 1
    I think so too, and I am not saying convmv will help him, I am using it to demonstrate how `list.files()` will list even the wrongly encoded filenames, where as `list.files( pattern="txt" )` with a pattern argument, will not. This offers OP a way to list the files in the first place, which he did not have. Again, convmv is just setting up the demo test case, not providing the solution. – Sirius Mar 05 '21 at 09:55
  • Oh that’s clever, didn’t notice. – Konrad Rudolph Mar 05 '21 at 10:12
  • Thank you so much for your comments, they were helpful to continue troubleshooting. Removing `pattern` did allow me to load them into the corpus, but then I got the "cannot open file: No such file or directory" error so I think the ultimate problem must be something else. The files appear in the environment but then I can't do anything with them further. – Katherine Drotos Mar 05 '21 at 16:01
  • Is this on a linux system? If so would it be an option to just fix all files using the convmv command? – Sirius Mar 05 '21 at 22:22