I have a large Corpus Object as a result of 3 Large files (>1gb total).
After cleaning the text I want to look at a random sample of the data say 1000 lines on my console to see if it is ok!
I am unable to find any source on how to sample data from a Corpus class in reasonable time (1 minute).
Some codes I ran were:
writeLines(as.character(docs), con="testing.txt")
head(strwrap(corp))
There are a lot of solutions here to visualize the entire data, but again takes too long.
The worst part is the only way to stop the processes due to code above is shutting down the console! I also looked at corpus_sample
. The closest to what I want came from str()
, which gave the first line of the first document and that's it in record time.
This answer seemed promising but, turns out the corpus object doesn't have documents$texts
in it (corp$documents$texts
)
- Why doesn't no body seem to need this feature?
- Is there a way to quickly sample a few random lines?
P.S
Very similar question asked here.