Text Mining PDFs - Convert List of Character Vectors (Strings) to Dataframe

Question

I'm using text mining packages to read a group of PDF documents into plaintext, and I want to export this plaintext to a dataframe/CSV/text files (to facilitate further analysis with RTextTools)

First, I pulled PDF documents into a VCorpus using the tm package. The tm package's VCorpus object stores lists containing a "PlainTextDocument" and "TextDocument" object for metadata and plaintext. I.e. "Metadata: DocumentName1"... and the content, "The terms of X are...".

   library(tm)

    docs <- VCorpus(DirSource(getwd()),readerControl = list(reader = readPDF))
    # Creates large VCorpus containing ~700 PlainTextDocuments 
    # (which contain strings/character vectors)

Unclear how to process this into a dataframe, so I managed to hunt down a package with a utility function to convert it into a list of strings.

   library(textreg)
   strings <- convert.tm.to.character(docs)
   # Converts VCorpus to large list of strings with document content

From either the VCorpus or this list of strings, I'd like to create a data frame of just one row, each containing a document's text, with column names corresponding to their original filename.

First I looked at this page, Export a list into a CSV or TXT file in R, and tried using sapply:

df <- data.frame(text = sapply(docs, as.character), stringsAsFactors = FALSE)
    ^Error during wrapup: arguments imply differing number of rows: 1, 5, 3, 3889, 3366

I've also found related threads (R tm package vcorpus: Error in converting corpus to data frame), but found them difficult since they tend to use simpler Corpus objects.

Is there a simpler way I can transform my list of strings or VCorpus to a dataframe, say using dplyr/tidyr/purrr?

Any suggestions on improving my hacked-together solution much appreciated.

Edit: Sample of data

Each element of my list contains a string(/chr vector) with a full document in text. For example,

 strings[3]

yields this output

[16] "Table of Contents"
[17] "Page"
[18] ""
[19] "Contracting Parties"
[20] ""
[21] "5"
. . .

[379] "â€œAffiliateâ€ means:"
[380] "(a)"
[381] ""
[382] "a company or any other entity in which any of the Parties holds, either directly or indirectly, the absolute"
[383] "majority of the votes in the shareholdersâ€™ meeting or is the holder of more than fifty percent (50%) of the rights"
[384] "and interests which confer the power of management on that company or entity, or has the power of"
[385] "management and control over such company or entity;"

emilliman5 · Accepted Answer · 2017-09-22T17:26:31.377

0

This should do the trick:

#dummy data generation: file names and a list of strings (your corpus)    
files <- paste("file", 1:6)

strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))

#             file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a"    "b"    "c"    "d"    "e"    "f"

Edit based on data structure edit

files <- paste("file", 1:6)

strings <- list(c("a","b"),c("c", "d"),c("e","f"),
                c("g","h"), c("i","j"), c("k", "l"))

names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " "))) 

#     file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b"  "c d"  "e f"  "g h"  "i j"  "k l"

edited Sep 22 '17 at 17:26

answered Sep 22 '17 at 15:55

emilliman5

5,816
3
27
37

This should work but for some reason I'm getting documentname.pdf1 | "The", documentname.pdf2 | "reason", documentname.pdf3 | "that". Not sure why it feels the need to split the text that way. I seem to be getting individual string tokens separated by spaces or line breaks instead of the full document text that was in the corpus for every document. What do you think is going wrong? (If this didn't occur this solution does work though.) – dad Sep 22 '17 at 16:15
Please edit your post to include a sample of your data. – emilliman5 Sep 22 '17 at 16:31

Text Mining PDFs - Convert List of Character Vectors (Strings) to Dataframe

Edit: Sample of data

1 Answers1

Edit based on data structure edit