1

How do you print a small sample, or first line, of a corpus in R using the tm package? I have a very large corpus ( > 1 GB) and am doing some text cleaning. I would like to test as I apply cleaning procedures. Printing just the first line, or first few lines of a corpus would be ideal.

# Load Libraries
library(tm)

# Read in Corpus
corp <- SimpleCorpus( DirSource( 
    "C:/TextDocument"))

# Remove puncuation
corp <- removePunctuation(corp,
                      preserve_intra_word_contractions = TRUE,
                      preserve_intra_word_dashes = TRUE)

I have tried accessing the corpus several ways:

# Print first line of first element of corpus
corp[[1]][[1]] 

# Print first line using 'content' element of corpus
corp[[1]]$content[[1]]

Both of these result in very long run times without the desired output.

The crude corpus in the tm package can be used for example purposes.

data("crude")
JHall651
  • 427
  • 1
  • 4
  • 15
  • 1
    Why don't you first take a subset of your corpus, perform all the text cleaning tests on it and then do it on the full corpus? or switch to quanteda. that works in parallel. Also fastest way of getting info out of the corpus is corp[[1]]$content[[1]]. You can do some tests with microbenchmark to check. – phiver Apr 21 '18 at 10:05

1 Answers1

1

strwrap does this job nicely since it prints your paragraphs formatted by breaking lines at word boundaries. (See ?strwrap.) Then you can use the head function to see the first 6 lines.

 head(strwrap(corp))
hpesoj626
  • 3,529
  • 1
  • 17
  • 25
  • 1
    `strwrap` works fine for the crude data, but with my corpus it takes many minutes on a fast machine. I had luck getting a very small sample of each element by trying `str(corp)`, but there is a lot of undesired additional output. Is there a faster way? – JHall651 Apr 21 '18 at 02:41
  • @JHall651, did you ever find an answer to this query or find a way that takes less time? Having the same issue here. Thank you. – Shawn Nov 23 '20 at 21:48