3

I have a corpus (622 docs) of lengthy txt files (ca. 20.000-30.000 words per file) that I'm trying to explore in R. I have done some basic text mining using the tm package and would like to delve into topic modeling now. However, being very new to this, I'm struggling already with some basics of data preparation. A sample of the files I'm currently working with is available here: http://s000.tinyupload.com/?file_id=46554569218218543610

  1. I'm assuming that just feeding these lengthy documents into a topic modeling tool is pointless. So I would like to break them up into paragraphs (or alternatively sets of perhaps 300-500 words, seeing as there is a lot of redundant paragraph breaks and OCR errors in my data). Would you do this within the VCorpus or should I actually divide my source files (e.g. with a shell script)? Any suggestions or experiences?

  2. The text comes from OCR'ed magazine articles, so if I split my docs up, I'm thinking I should add a metadata tag to these paragraphs that tells me which issue it was from originally (basically just the original file name), correct? Is there a way to do this easily?

  3. Generally speaking, can anyone recommend a good hands-on introduction to topic modeling in R? A tutorial that takes me by the hand like a third-grader would be great, actually. I am using the documentation of 'topicmodels' and 'lda', but the learning curve is rather steep for a novice. edit: Just to be clear, I have already read a lot of the popular introductions to topic modeling (e.g. Scott Weingart and the MALLET tutorials for Historians). I was thinking of something specific to the processes in R.

Hope that these questions aren't entirely redundant. Thanks for taking the time to read!

Markus D
  • 187
  • 1
  • 3
  • 10
  • This may be a better fit for [Cross Validated](http://stats.stackexchange.com/) (ie, stats.SE)--it's hard to say. *Please don't cross-post, though*. If you don't get a satisfactory answer here, you can flag your Q & ask the moderators to migrate it. – gung - Reinstate Monica Oct 29 '13 at 14:23
  • Did you look at the topicmodels vignette:http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf – Tyler Rinker Oct 29 '13 at 15:04
  • Thanks for your comments. 1. I will check Cross Validated in the future, much appreciated. 2. The topicmodels vignette is very helpful regarding the theoretical background and gives me a vague idea of the functions and commands needed for the actual topic modeling process. But the data in the example is prepared already, so it does not really help me with how to best pre-process it. Thank you though! – Markus D Oct 29 '13 at 15:59

2 Answers2

3

I had recently a similar project, usually, at least some of this steps are done:

  • Stop-words removal: you can easily do this via the removeWords(your corpus, stopwords("english")) from the tm package. Further you can construct your own stop word list and remove it via the same function.
  • usually you also numbers and punctuation's (see tm package) are removed.
  • also very common is stemming (see Wikipedia for an explanation) and removing sparse terms, this helps to reduce the dimension of your term document matrix with only little loss of information (both in tm and RWeka package).
  • some people also like to work only with nouns/proper nouns or noun phrases. See here for an overview and some word lists and part of speech dictionaries you can be found on Kevin's Word List Page.
  • regarding splitting in paragraphs: this should be possible with the NgramTokenizer from Rweka package see tm package FAQ.
  • A nice article about pre-processing in general can be found here or more scientific here.
  • regarding meta data management see tm package vignette.
  • One more example of R + topic models can be found Ponweiser 2012

I learned that text mining is a bit different. Things which improved results in one case do not work in an other case. It is a lot of testing which parameters and which pre-processing steps improve your results...So have fun!

holzben
  • 1,459
  • 16
  • 24
  • That's terrific advice and I'll put it to use. Much appreciated! Appears I don't have the reputation to upvote you guys yet, but oh well, thank you! – Markus D Oct 30 '13 at 00:13
3

There's no code in your question, so it's not really suitable for this site. That said, here are some comments that might be useful. If you supply code you'll get more specific and useful answers.

  1. Yes. Breaking the text into chunks is common and advisable. Exact sizes are a matter of taste. It is often done within R, I've done it before making the corpus. You might also subset only nouns, like @holzben suggests. Here some code for cutting a corpus into chunks:

    corpus_chunk <- function(x, corpus, n) {
    # convert corpus to list of character vectors
    message("converting corpus to list of vectors...")
    listofwords <- vector("list", length(corpus))
    for(i in 1:length(corpus))
      {
      listofwords[[i]] <- corpus[[i]]
      }
    message("done")
    # divide each vector into chunks of n words
    # from http://stackoverflow.com/q/16232467/1036500
    f <- function(x) 
    {
    y <- unlist(strsplit(x, " "))
    ly <- length(y)
    split(y, gl(ly%/%n+1, n, ly))
    }
    message("splitting documents into chunks...")
    listofnwords1 <- sapply(listofwords, f)
    listofnwords2 <- unlist(listofnwords1, recursive = FALSE)
    message("done")
    # append IDs to list items so we can get bibliographic data for each chunk
    lengths <- sapply(1:length(listofwords), function(i) length(listofnwords1[[i]]))
    names(listofnwords2) <- unlist(lapply(1:length(lengths), function(i)  rep(x$bibliodata$x[i], lengths[i])))
    names(listofnwords2) <- paste0(names(listofnwords2), "_", unlist(lapply(lengths,     function(x) seq(1:x))))
    return(listofnwords2)
    }   
    
  2. Yes, you might make a start with some code and then come back with a more specific question. That's how you'll get the most out of this site.

  3. For a basic introduction to text mining and topic modelling, see Matthew Jockers' book Text Analysis with R for Students of Literature

If you're already a little familiar with MALLET, then try rmallet for topic modelling. There's lots of code snippets on the web that use this, here's one of mine.

Ben
  • 41,615
  • 18
  • 132
  • 227
  • Thank you for your suggestions, Ben. Apologies for going off-topic with this. I should have realized that. Won't happen again. – Markus D Oct 30 '13 at 00:11
  • 1
    No worries, your first two questions would be perfect for this forum if you just add a bit of code to show what you've already tried. Why not ask them separately when you've got some code ready (and you can code something that others here can reproduce)? – Ben Oct 30 '13 at 02:50
  • That snippet of code looks really useful. I'll try to work on my code and will post it in a separate question when I come across the next hurdle (surely there is one to come). :) Thank you! – Markus D Oct 31 '13 at 02:23