2

I have a folder with more than 2,000 rtf documents. I want to import them into r (preferable into a data frame that can be used in combination with the tidytext package). In addition, I need an additional column, adding the filename so that I can link the content of each rtf document to the filename (later, I will also have to extract information from the filename and save it into seperate columns of my data set).

I came across a solution by Jens Leerssen that I tried to adapt to my requirements:

require(textreadr)

read_plus <- function(flnm) {
read_rtf(flnm) %>% 
    mutate(filename = flnm)
}

tbl_with_sources <-
    list.files(path= "./data", pattern = "*.rtf", 
           full.names = TRUE) %>% 
map_df(~read_plus(.))

However, I get the following error message:

Error in UseMethod("mutate_") : no applicable method for 'mutate_' applied to an object of class "character"

Can anyone tell me why this error occurs or propose another solution to my problem?

feder80
  • 1,195
  • 3
  • 13
  • 34

1 Answers1

1

I finally solved the problem, with some workaround.

1) I converted the *.rft files to *.txt files by using the textutil command in the MacOSX terminal:

find . -name \*.rtf -print0 | xargs -0 textutil -convert txt

By doing so, I get also rid of formatting.

2) I then used the read_plus function of Jens Lerrssen. However I now use read.delim instead of read_rtf and included two options (stringsAsFactors and quote) to get rid of warnings and/or errors:

read_plus <- function(flnm) {
    read.delim(flnm, header = FALSE, stringsAsFactors = FALSE, quote = "") %>% 
            mutate(filename = flnm)
}

3) Finally, I read in all the *.txt files and renamed the columnn V1 at the end.

df <- list.files(path = "./data", pattern = "*.txt", 
               full.names = TRUE) %>% 
    map_df(~read_plus(.)) %>%
    rename(paragraph = V1)
feder80
  • 1,195
  • 3
  • 13
  • 34