I have a corpus (622 docs) of lengthy txt files (ca. 20.000-30.000 words per file) that I'm trying to explore in R. I have done some basic text mining using the tm package and would like to delve into topic modeling now. However, being very new to this, I'm struggling already with some basics of data preparation. A sample of the files I'm currently working with is available here: http://s000.tinyupload.com/?file_id=46554569218218543610
I'm assuming that just feeding these lengthy documents into a topic modeling tool is pointless. So I would like to break them up into paragraphs (or alternatively sets of perhaps 300-500 words, seeing as there is a lot of redundant paragraph breaks and OCR errors in my data). Would you do this within the VCorpus or should I actually divide my source files (e.g. with a shell script)? Any suggestions or experiences?
The text comes from OCR'ed magazine articles, so if I split my docs up, I'm thinking I should add a metadata tag to these paragraphs that tells me which issue it was from originally (basically just the original file name), correct? Is there a way to do this easily?
Generally speaking, can anyone recommend a good hands-on introduction to topic modeling in R? A tutorial that takes me by the hand like a third-grader would be great, actually. I am using the documentation of 'topicmodels' and 'lda', but the learning curve is rather steep for a novice. edit: Just to be clear, I have already read a lot of the popular introductions to topic modeling (e.g. Scott Weingart and the MALLET tutorials for Historians). I was thinking of something specific to the processes in R.
Hope that these questions aren't entirely redundant. Thanks for taking the time to read!