2

I have a large Corpus Object as a result of 3 Large files (>1gb total).

After cleaning the text I want to look at a random sample of the data say 1000 lines on my console to see if it is ok!

I am unable to find any source on how to sample data from a Corpus class in reasonable time (1 minute).

Some codes I ran were:

writeLines(as.character(docs), con="testing.txt")

head(strwrap(corp))

There are a lot of solutions here to visualize the entire data, but again takes too long.

The worst part is the only way to stop the processes due to code above is shutting down the console! I also looked at corpus_sample. The closest to what I want came from str(), which gave the first line of the first document and that's it in record time.

This answer seemed promising but, turns out the corpus object doesn't have documents$texts in it (corp$documents$texts)

  1. Why doesn't no body seem to need this feature?
  2. Is there a way to quickly sample a few random lines?

P.S

Very similar question asked here.

agent18
  • 2,109
  • 4
  • 20
  • 34
  • Is your data in a Corpus from tm or in a corpus from quanteda?As corpus_sample comes from the quanteda package and you are listing tm in the tag section. – phiver Jul 06 '19 at 08:54
  • 1
    I was looking at the tm package. But then moved on to quanteda which was much better in speed and usability and ram usage. I will write a followup soon. I was able to sample from corpus in quanteda using `kwic` and texts(corpus_object). I will check out `corpus_sample` and let you know, that sounds like what I want. – agent18 Jul 15 '19 at 14:46

0 Answers0