11

I'm completely new in R and tm package, so please excuse my stupid question ;-) How can I show the text of a plain text corpus in R tm package?

I've loaded a corpus with 323 plain text files in a corpus:

 src <- DirSource("Korpora/technologie")
corpus <- Corpus(src)

But when I call the corpus with:

corpus[[1]]

I always get some output like this instead of the corpus text itself:

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 144
Content:  chars: 141
Content:  chars: 224
Content:  chars: 75
Content:  chars: 105

How can I show the text of the corpus?

Thanks!

UPDATE Reproducible sample: I've tried it with the built-in sample text:

> data("crude")
> crude
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
> crude[1]
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 1
> crude[[1]]
<<PlainTextDocument>>
Metadata:  15
Content:  chars: 527

How can I print the text of the documents?

UPDATE 2: Session Info:

> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C                   
[5] LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-1  NLP_0.1-7

loaded via a namespace (and not attached):
[1] parallel_3.1.3 slam_0.1-32    tools_3.1.3   
Azrael
  • 385
  • 2
  • 5
  • 13
  • Welcome to SO. Please provide a minimal reproducible example: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – lukeA May 25 '15 at 09:37

8 Answers8

41

This works in mine, to print the content text, with latest version of tm,

corpus[[1]]$content

Note: More or less as suggested by Ricky in the previous comment. Sorry, I wanted to write comment, only my rep is only 25 (need min. of 50 rep to comment).

silo
  • 2,493
  • 2
  • 17
  • 13
  • This works. Does anyone know why this needs to be added? The brackets used to work alone without adding $content – Ryan Chase Sep 28 '15 at 17:00
13

You can try converting your corpus text into a dataframe, and accessing the required text from the dataframe itself. I have used the built-in sample data "crude" (from the tm package) as an example.

data("crude")
dataframe<-data.frame(text=unlist(sapply(crude, `[`, "content")), stringsAsFactors=F)

dataframe[1,]
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n    The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n    \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n    Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
Analytical Monk
  • 369
  • 3
  • 14
8

Here is a simple and direct way to display the text of a corpus:

strwrap(corpus[[1]])

For the crude data this will output

[1] "Diamond Shamrock Corp said that effective today it had cut its contract"      
[2] "prices for crude oil by 1.50 dlrs a barrel.  The reduction brings its posted" 
[3] "price for West Texas Intermediate to 16.00 dlrs a barrel, the copany said."   
[4] "\"The price reduction today was made in the light of falling oil product"     
[5] "prices and a weak crude oil market,\" a company spokeswoman said.  Diamond is"
[6] "the latest in a line of U.S. oil companies that have cut its contract, or"    
[7] "posted, prices over the last two days citing weak oil markets.  Reuter"
S. Elzwawi
  • 531
  • 12
  • 14
3

I can confirm that as of tm 0.6-1 the inspect does not print pretty. You can pair it with the qdap package that I maintain to convert easily to a data.frame as folows:

library(qdap)
as.data.frame(crude)

To make it more ike the old inspect behavior you can use:

as.data.frame(crude) %>%
    with(., invisible(sapply(text, function(x) {strWrap(x); cat("\n\n")})))

This looks like:

Diamond Shamrock Corp said that effective today it had cut its
contract prices for crude oil by 1.50 dlrs a barrel. The reduction
brings its posted price for West Texas Intermediate to 16.00 dlrs a
barrel, the copany said. "The price reduction today was made in the
light of falling oil product prices and a weak crude oil market," a
company spokeswoman said. Diamond is the latest in a line of U.S. oil
companies that have cut its contract, or posted, prices over the last
two days citing weak oil markets. Reuter


OPEC may be forced to meet before a scheduled June session to
readdress its production cutting agreement if the organization wants
to halt the current slide in oil prices, oil industry analysts said.
"The movement to higher oil prices was never to be as easy as OPEC
thought. They may need an emergency meeting to sort out the
problems," said Daniel Yergin, director of Cambridge Energy Research
Associates, CERA. Analysts and oil industry sources said the problem
OPEC faces is excess oil supply in world oil markets. "OPEC's problem
is not a price problem but a production issue and must be addressed
in that way," said Paul Mlotok, oil analyst with Salomon Brothers
Inc. He said the market's earlier optimism about OPE
.
.
.
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
2

From the tm Vignette, this works:

writeLines(as.character(doc.corpus[[8]]))

Where '8' is whatever element number you wish

jonmrich
  • 4,233
  • 5
  • 42
  • 94
Barry DeCicco
  • 251
  • 1
  • 7
1

We can get the content of every item in the corpus.

data("crude")
out <- sapply(crude, function(x){x$content})
out 

# optionally export
writeCorpus(out, "outputdir/", filenames = "corpus.txt")
Selva
  • 2,045
  • 1
  • 23
  • 18
0
> inspect(crude[1])
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

$`reut-00001.xml`
<<PlainTextDocument (metadata: 15)>>
Diamond Shamrock Corp said that
effective today it had cut its contract prices for crude oil by
1.50 dlrs a barrel.
    The reduction brings its posted price for West Texas
Intermediate to 16.00 dlrs a barrel, the copany said.
    "The price reduction today was made in the light of falling
oil product prices and a weak crude oil market," a company
spokeswoman said.
    Diamond is the latest in a line of U.S. oil companies that
have cut its contract, or posted, prices over the last two days
citing weak oil markets.
 Reuter
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • Sorry, did not work:
    > inspect(crude[1]) <> Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 1´ $`reut-00001.xml` <> Metadata: 15 Content: chars: 527 >
    – Azrael May 25 '15 at 09:55
  • 1
    That is interesting, it works fine on mine. Can you try `crude[1]$content` ? – Ricky May 25 '15 at 09:58
  • The same. I use RStudio, maybe that's the problem or did I miss some setting in RStudio?
    UPDATE: Same in R console
    – Azrael May 25 '15 at 10:00
  • I use RStudio so I don't think that's the case. Can you do `sessionInfo()` and paste the output to the questions for others to see also? I suspect conflict in packages. – Ricky May 25 '15 at 10:05
  • 1
    @Ricky I think this print behavior changed from 0.6 to 0.6-1 of tm though it isn't documented in the NEWS file. – Tyler Rinker May 25 '15 at 14:47
  • @Ricky do you know why adding $content solves the issue? – Ryan Chase Sep 28 '15 at 16:59
  • 1
    @RyanChase type `str(crude[[1]])` and you'll see that the underlying structure is a list of lists, where each document is a list with two elements, one is `content` and one is a sub-list `meta` which contains other attributes of the document. `crude[1]$content` simply accesses the content values directly without going through the convenience functions. – Ricky Oct 05 '15 at 07:20
-1

I had the same issue, and corpus[[1]]$content worked for me

Barsha
  • 1