3

how can I convert a corpus into a data frame in R which contains also meta data? I already tried the suggestion from convert corpus into data.frame in R, but the resulting data frame only contains the text lines from all docs in the corpus. I need also the document ID and maybe the line number of the text line in two columns. So, how can I extend this command: dataframe <- data.frame(text=unlist(sapply(mycorpus,[, "content")), stringsAsFactors=FALSE) to get the data?

I already tried

    dataframe <- 
data.frame(id=sapply(corpus, meta(corpus, "id")), 
text=unlist(sapply(corpus, `[`, "content")), 
stringsAsFactors=F)

but it didn't help; I only got an error message "Error in match.fun(FUN) : 'meta(corpus, "id")' ist nicht Funktion, Zeichen oder Symbol"

The corpus is extracted from plain text files; here is an example:

> str(corpus)
[...]
$ 1178531510 :List of 2
  ..$ content: chr [1:67] " uberrasch sagt [...] gemacht echt schad verursacht" ...
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2015-08-16 14:44:11"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1178531510" # <--- This is the ID i want in the data.frame
  .. ..$ language     : chr "de"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[...]

Many thanks in advance :)

Azrael
  • 385
  • 2
  • 5
  • 13
  • 2
    `sapply(corpus, meta(corpus, "id"))` should be `sapply(corpus, meta, "id")` – scoa Aug 16 '15 at 15:13
  • Thanks, this seems to work little better, but now I got this error: `Error in data.frame(id = sapply(corpus, meta, "id"), text = unlist(sapply(corpus, : Argumente implizieren unterschiedliche Anzahl Zeilen: 323, 10012` – Azrael Aug 16 '15 at 15:34
  • well, your code and my correction work for the example dataset `data(acq) ; corpus <- acq`, so the problem probably is in your data : you get 323 id, but 10012 text content (I would guess, by my german is rusty)... What is the length of your corpus? Could you post a sample part of the corpus that reproduces the problem? – scoa Aug 16 '15 at 16:52
  • The corpus contains 323 documents, but 10012 lines of text. In the dataframe, every line of text is a row. – Azrael Aug 16 '15 at 21:10
  • 1
    you could try this : `unlist(lapply(sapply(corpus, `[`, "content"),paste,collapse="\n"))` – scoa Aug 16 '15 at 21:53
  • I tried `unlist(lapply(sapply(corpus, `[`, "content"),paste,collapse="\n"))` but when I created the data frame with the code above I got the error `Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"` – Azrael Aug 17 '15 at 06:48
  • you need to share some data, It suggest the corpus is not a corpus object – scoa Aug 17 '15 at 11:00
  • Now I tried `dfcorpus <- data.frame(id=sapply(corpus, meta, "id"), text=unlist(lapply(sapply(corpus, `[`, "content"),paste,collapse="\n")), stringsAsFactors=F)` and it worked! Thank you! :) – Azrael Aug 18 '15 at 08:00

1 Answers1

0

There are two problems : you should not repeat the argument corpus in sapply, and multi-paragraphs texts are turned to character vectors of length > 1 which you should paste together before unlisting.

dataframe <- 
    data.frame(id=sapply(corpus, meta, "id"),
               text=unlist(lapply(sapply(corpus, '[', "content"),paste,collapse="\n")),
               stringsAsFactors=FALSE)
scoa
  • 19,359
  • 5
  • 65
  • 80