1

I'm using text mining packages to read a group of PDF documents into plaintext, and I want to export this plaintext to a dataframe/CSV/text files (to facilitate further analysis with RTextTools)

First, I pulled PDF documents into a VCorpus using the tm package. The tm package's VCorpus object stores lists containing a "PlainTextDocument" and "TextDocument" object for metadata and plaintext. I.e. "Metadata: DocumentName1"... and the content, "The terms of X are...".

   library(tm)

    docs <- VCorpus(DirSource(getwd()),readerControl = list(reader = readPDF))
    # Creates large VCorpus containing ~700 PlainTextDocuments 
    # (which contain strings/character vectors)

Unclear how to process this into a dataframe, so I managed to hunt down a package with a utility function to convert it into a list of strings.

   library(textreg)
   strings <- convert.tm.to.character(docs)
   # Converts VCorpus to large list of strings with document content

From either the VCorpus or this list of strings, I'd like to create a data frame of just one row, each containing a document's text, with column names corresponding to their original filename.

First I looked at this page, Export a list into a CSV or TXT file in R, and tried using sapply:

df <- data.frame(text = sapply(docs, as.character), stringsAsFactors = FALSE)
    ^Error during wrapup: arguments imply differing number of rows: 1, 5, 3, 3889, 3366

I've also found related threads (R tm package vcorpus: Error in converting corpus to data frame), but found them difficult since they tend to use simpler Corpus objects.

Is there a simpler way I can transform my list of strings or VCorpus to a dataframe, say using dplyr/tidyr/purrr?

Any suggestions on improving my hacked-together solution much appreciated.

Edit: Sample of data

Each element of my list contains a string(/chr vector) with a full document in text. For example,

 strings[3] 

yields this output

[16] "Table of Contents"
[17] "Page"
[18] ""
[19] "Contracting Parties"
[20] ""
[21] "5"
. . .

[379] "“Affiliate†means:"
[380] "(a)"
[381] ""
[382] "a company or any other entity in which any of the Parties holds, either directly or indirectly, the absolute"
[383] "majority of the votes in the shareholders’ meeting or is the holder of more than fifty percent (50%) of the rights"
[384] "and interests which confer the power of management on that company or entity, or has the power of"
[385] "management and control over such company or entity;"

dad
  • 1,335
  • 9
  • 28

1 Answers1

0

This should do the trick:

#dummy data generation: file names and a list of strings (your corpus)    
files <- paste("file", 1:6)

strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))

#             file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a"    "b"    "c"    "d"    "e"    "f"  

Edit based on data structure edit

files <- paste("file", 1:6)

strings <- list(c("a","b"),c("c", "d"),c("e","f"),
                c("g","h"), c("i","j"), c("k", "l"))

names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " "))) 

#     file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b"  "c d"  "e f"  "g h"  "i j"  "k l"  
emilliman5
  • 5,816
  • 3
  • 27
  • 37
  • This should work but for some reason I'm getting documentname.pdf1 | "The", documentname.pdf2 | "reason", documentname.pdf3 | "that". Not sure why it feels the need to split the text that way. I seem to be getting individual string tokens separated by spaces or line breaks instead of the full document text that was in the corpus for every document. What do you think is going wrong? (If this didn't occur this solution does work though.) – dad Sep 22 '17 at 16:15
  • Please edit your post to include a sample of your data. – emilliman5 Sep 22 '17 at 16:31