15
  1. I have a number of PDF documents, which I have read into a corpus with library tm. How can one break the corpus into sentences?

  2. It can be done by reading the file with readLines followed by sentSplit from package qdap [*]. That function requires a dataframe. It would also would require to abandon the corpus and read all files individually.

  3. How can I pass function sentSplit {qdap} over a corpus in tm? Or is there a better way?.

Note: there was a function sentDetect in library openNLP, which is now Maxent_Sent_Token_Annotator - the same question applies: how can this be combined with a corpus [tm]?

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Henk
  • 3,634
  • 5
  • 28
  • 54

7 Answers7

16

I don't know how to reshape a corpus but that would be a fantastic functionality to have.

I guess my approach would be something like this:

Using these packages

# Load Packages
require(tm)
require(NLP)
require(openNLP)

I would set up my text to sentences function as follows:

convert_text_to_sentences <- function(text, lang = "en") {
  # Function to compute sentence annotations using the Apache OpenNLP Maxent sentence detector employing the default model for language 'en'. 
  sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = lang)

  # Convert text to class String from package NLP
  text <- as.String(text)

  # Sentence boundaries in text
  sentence.boundaries <- annotate(text, sentence_token_annotator)

  # Extract sentences
  sentences <- text[sentence.boundaries]

  # return sentences
  return(sentences)
}

And my hack of a reshape corpus function (NB: you will lose the meta attributes here unless you modify this function somehow and copy them over appropriately)

reshape_corpus <- function(current.corpus, FUN, ...) {
  # Extract the text from each document in the corpus and put into a list
  text <- lapply(current.corpus, Content)

  # Basically convert the text
  docs <- lapply(text, FUN, ...)
  docs <- as.vector(unlist(docs))

  # Create a new corpus structure and return it
  new.corpus <- Corpus(VectorSource(docs))
  return(new.corpus)
}

Which works as follows:

## create a corpus
dat <- data.frame(doc1 = "Doctor Who is a British science fiction television programme produced by the BBC. The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor. He explores the universe in his TARDIS (acronym: Time and Relative Dimension in Space), a sentient time-travelling space ship. Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired. Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.",
                  doc2 = "The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive (2005–10) awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.[3][4] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor. In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody \"for evolving with technology and the times like nothing else in the known television universe.\"[5]",
                  doc3 = "The programme is listed in Guinness World Records as the longest-running science fiction television show in the world[6] and as the \"most successful\" science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.[7] During its original run, it was recognised for its imaginative stories, creative low-budget special effects, and pioneering use of electronic music (originally produced by the BBC Radiophonic Workshop).",
                  stringsAsFactors = FALSE)

current.corpus <- Corpus(VectorSource(dat))
# A corpus with 3 text documents

## reshape the corpus into sentences (modify this function if you want to keep meta data)
reshape_corpus(current.corpus, convert_text_to_sentences)
# A corpus with 10 text documents

My sessionInfo output

> sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
  [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] NLP_0.1-0     openNLP_0.2-1 tm_0.5-9.1   

loaded via a namespace (and not attached):
  [1] openNLPdata_1.5.3-1 parallel_3.0.1      rJava_0.9-4         slam_0.1-29         tools_3.0.1  
Tony Breyal
  • 5,338
  • 3
  • 29
  • 49
  • 1
    I adapted your first code block into a separate function. However, I get a Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class "c("Simple_Sent_Token_Annotator", "Annotator")" to a data.frame. See my gist here. https://gist.github.com/simkimsia/9ace6002cc758d5a303a – Kim Stacks Nov 15 '14 at 10:38
  • @KimStacks I got the exact problem. It disappeared after I relaunched RStudio, but resurfaced later. Did you figure out what's going on here? – Logan Yang Nov 23 '14 at 01:57
  • @LoganYang in the end I got what I needed using library("qdap") and its own native sent_detect. See this http://stackoverflow.com/a/26961090/80353 – Kim Stacks Nov 23 '14 at 05:44
  • 2
    @KimStacks I found the problem. It was because ggplot2 and openNLP both have their annotate method, and I loaded ggplot2 after openNLP so that the annotate object was masked by ggplot2. Try loading openNLP after ggplot2, it'll be fine. – Logan Yang Nov 25 '14 at 08:41
  • Hi! What is "Content" in the reshape_corpus function? Nice solution :) – woodstock Nov 30 '15 at 13:32
  • 1
    @woodstock Thanks, I'd forgotten about this function. "Content" was a function from the "tm" package which basically extracted text from a document within a corpus. I think in the newest version of the package it is called "content_transformer" and you can find an example of it in the tm package by doing ?tm_map and ?content_transformer – Tony Breyal Dec 09 '15 at 12:53
  • @TonyBreyal thank you very much! I could need this function in the future! :) – woodstock Dec 10 '15 at 11:23
5

openNLP had some major changes. The bad news is it looks very different than it used to. The good news is that it's more flexible and the functionality you enjoyed before is still there, you just have to find it.

This will give you what you're after:

?Maxent_Sent_Token_Annotator

Just work through the example and you'll see the functionality you're looking for.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Hi Tyler, have done that, and get: > sent_token_annotator <- Maxent_Sent_Token_Annotator() Error: could not find function "Maxent_Sent_Token_Annotator". Libraries openNLP and NLP loaded. Also, how can this be applied on a corpus? For a dataframe we have the supersimple sentDetect {qdap}. – Henk Sep 11 '13 at 06:18
  • I think you might have old versions of `openNLP` and/or `NLP`. Use `packageDescription("openNLP")["Version"]` and if its not `"0.2-1"` then use `install.packages("openNLP")`. – Tyler Rinker Sep 11 '13 at 12:03
  • The reason I'm pushing you this way is that `qdap` has very specific exceptions of how your data is cleaned (all abbreviations removed). Additionally, `sentSplit` is designed as a data manipulation to reshape the data in a way that `qdap` expects for other functions. You're more interested in changing a corpus. – Tyler Rinker Sep 11 '13 at 12:23
  • Tx...updated openNLP to "0.2-1" and NLP is at "0.1-0". I copied example text straight from documentation but still get the error message "> sent_token_annotator <- Maxent_Sent_Token_Annotator() Error: could not find function "Maxent_Sent_Token_Annotator"" – Henk Sep 11 '13 at 13:19
  • Can you provide `sessionInfo()`? – Tyler Rinker Sep 11 '13 at 13:33
  • message would be too long, where i can i send this info Tyler? – Henk Sep 11 '13 at 13:42
  • post it in your question. If it's really too long then consider placing in drop box public folder as a .txt file and the provide the link here. – Tyler Rinker Sep 11 '13 at 14:02
  • [cut and paste session info]: R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) other attached packages: [1] NLP_0.1-0 openNLP_0.2-1 ggplot2_0.9.3.1 gdata_2.13.2 plyr_1.8 reshape2_1.2.2 maxent_1.3.3 tm_0.5-9.1 SparseM_1.03 Rcpp_0.10.4 [10] loaded via a namespace (and not attached):openNLPdata_1.5.3-1 – Henk Sep 11 '13 at 14:28
  • Restarted the computer this morning and everything works now. The remaining question is the original one: how can i apply a sentence splitter on a corpus? – Henk Sep 12 '13 at 06:50
  • 1
    You can create your own function and apply that just like you did with `sentDetect` before. I have done this with `tagPOS` [here](https://github.com/trinker/qdap/blob/master/R/pos.R) (see second function in the file). I basically took the example and reworked it into the function. – Tyler Rinker Sep 12 '13 at 06:57
  • @TylerRinker I provide a comment on the qdap method to your answer below on why openNLP is probably a better way to go – christopherlovell Jan 09 '15 at 16:56
  • @polyphant "better way to go is highly subjective" as it depends on your data, assumptions made, and the intended analysis. In this specific case the OP wanted to pass `sentSplit` to a `Corpus`. Also, note that the dev version of qdap (v. 2.2.1) @ GitHub contains the `sent_detect_nlp` to allow flexibility as it uses the method described above. This allows `tm_map(current.corpus, sent_detect_nlp)`. See commit: https://github.com/trinker/qdap/commit/d54fbd95f0caee68a3845353a06fa324d7fdf42e – Tyler Rinker Jan 09 '15 at 17:18
  • @TylerRinker Granted I worded that a bit wooly. And thanks for the heads up on dev qdap, which looks like it fixes this issue. I still think it's hard to recommend the release version as a sentence parser compared to openNLP. And you may be a bit subjective since you appear to be the maintainer for qdap ;) it's a great package that I use a lot, so thanks! – christopherlovell Jan 09 '15 at 18:19
1

Just convert your corpus into a dataframe and use regular expressions to detect the sentences.

Here is a function that uses regular expressions to detect sentences in a paragraph and returns each individual sentence.

chunk_into_sentences <- function(text) {
      break_points <- c(1, as.numeric(gregexpr('[[:alnum:] ][.!?]', text)[[1]]) + 1)
      sentences <- NULL
      for(i in 1:length(break_points)) {
        res <- substr(text, break_points[i], break_points[i+1]) 
        if(i>1) { sentences[i] <- sub('. ', '', res) } else { sentences[i] <- res }
      }
      sentences <- sentences[sentences=!is.na(sentences)]
      return(sentences)
    }

...Using one paragraph inside a corpus from the tm package.

text <- paste('Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.')
mycorpus <- VCorpus(VectorSource(text))
corpus_frame <- data.frame(text=unlist(sapply(mycorpus, `[`, "content")), stringsAsFactors=F)

Use as follows:

chunk_into_sentences(corpus_frame)

Which gives us:

[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industry standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                       
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

Now with a larger corpus

text1 <- "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."
text2 <- "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like)."
text3 <- "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."
text_list <- list(text1, text2, text3)
my_big_corpus <- VCorpus(VectorSource(text_list))

Use as follows:

lapply(my_big_corpus, chunk_into_sentences)

Which gives us:

$`1`
[1] "Lorem Ipsum is simply dummy text of the printing and typesetting industry."                                                                                                                                     
[2] "Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book."                                      
[3] "It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged."                                                                                       
[4] "It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum."

$`2`
[1] "It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout."                                                             
[2] "The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English."     
[3] "Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy."

$`3`
[1] "There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable."
[2] "If you are going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the middle of text."                                                                     
[3] "All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet."                                                       
[4] "It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable."                                                       
[5] "The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc." 
Cybernetic
  • 12,628
  • 16
  • 93
  • 132
1

This is a function built off this Python solution that allows some flexibility in that the lists of prefixes, suffixes, etc. can be modified to your specific text. It's definitely not perfect, but could be useful with the right text.

caps = "([A-Z])"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|Mt)\\."
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
starters = "(Mr|Mrs|Ms|Dr|He\\s|She\\s|It\\s|They\\s|Their\\s|Our\\s|We\\s|But\\s|However\\s|That\\s|This\\s|Wherever)"
websites = "\\.(com|edu|gov|io|me|net|org)"
digits = "([0-9])"

split_into_sentences <- function(text){
  text = gsub("\n|\r\n"," ", text)
  text = gsub(prefixes, "\\1<prd>", text)
  text = gsub(websites, "<prd>\\1", text)
  text = gsub('www\\.', "www<prd>", text)
  text = gsub("Ph.D.","Ph<prd>D<prd>", text)
  text = gsub(paste0("\\s", caps, "\\. "), " \\1<prd> ", text)
  text = gsub(paste0(acronyms, " ", starters), "\\1<stop> \\2", text)
  text = gsub(paste0(caps, "\\.", caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>\\3<prd>", text)
  text = gsub(paste0(caps, "\\.", caps, "\\."), "\\1<prd>\\2<prd>", text)
  text = gsub(paste0(" ", suffixes, "\\. ", starters), " \\1<stop> \\2", text)
  text = gsub(paste0(" ", suffixes, "\\."), " \\1<prd>", text)
  text = gsub(paste0(" ", caps, "\\."), " \\1<prd>",text)
  text = gsub(paste0(digits, "\\.", digits), "\\1<prd>\\2", text)
  text = gsub("...", "<prd><prd><prd>", text, fixed = TRUE)
  text = gsub('\\.”', '”.', text)
  text = gsub('\\."', '\".', text)
  text = gsub('\\!"', '"!', text)
  text = gsub('\\?"', '"?', text)
  text = gsub('\\.', '.<stop>', text)
  text = gsub('\\?', '?<stop>', text)
  text = gsub('\\!', '!<stop>', text)
  text = gsub('<prd>', '.', text)
  sentence = strsplit(text, "<stop>\\s*")
  return(sentence)
}

test_text <- 'Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years. He earned $2.5 million when it sold! Now he works at www.website.com.'
sentences <- split_into_sentences(test_text)
names(sentences) <- 'sentence'
df_sentences <- dplyr::bind_rows(sentences) 

df_sentences
# A tibble: 3 x 1
sentence                                                     
<chr>                                                        
1 Dr. John Johnson, Ph.D. worked for X.Y.Z. Inc. for 4.5 years.
2 He earned $2.5 million when it sold!                         
3 Now he works at www.website.com.  
sbha
  • 9,802
  • 2
  • 74
  • 62
0

With qdap version 1.1.0 you can accomplish this with the following (I used @Tony Breyal's current.corpus dataset):

library(qdap)
with(sentSplit(tm_corpus2df(current.corpus), "text"), df2tm_corpus(tot, text))

You could also do:

tm_map(current.corpus, sent_detect)


## inspect(tm_map(current.corpus, sent_detect))

## A corpus with 3 text documents
## 
## The metadata consists of 2 tag-value pairs and a data frame
## Available tags are:
##   create_date creator 
## Available variables in the data frame are:
##   MetaID 
## 
## $doc1
## [1] Doctor Who is a British science fiction television programme produced by the BBC.                                                                     
## [2] The programme depicts the adventures of a Time Lord—a time travelling, humanoid alien known as the Doctor.                                            
## [3] He explores the universe in his TARDIS, a sentient time-travelling space ship.                                                                        
## [4] Its exterior appears as a blue British police box, a common sight in Britain in 1963, when the series first aired.                                    
## [5] Along with a succession of companions, the Doctor faces a variety of foes while working to save civilisations, help ordinary people, and right wrongs.
## 
## $doc2
## [1] The show has received recognition from critics and the public as one of the finest British television programmes, winning the 2006 British Academy Television Award for Best Drama Series and five consecutive awards at the National Television Awards during Russell T Davies's tenure as Executive Producer.
## [2] In 2011, Matt Smith became the first Doctor to be nominated for a BAFTA Television Award for Best Actor.                                                                                                                                                                                                       
## [3] In 2013, the Peabody Awards honoured Doctor Who with an Institutional Peabody for evolving with technology and the times like nothing else in the known television universe.                                                                                                                                   
## 
## $doc3
## [1] The programme is listed in Guinness World Records as the longest-running science fiction television show in the world and as the most successful science fiction series of all time—based on its over-all broadcast ratings, DVD and book sales, and iTunes traffic.
## [2] During its original run, it was recognised for its imaginative stor
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 1
    Unfortunately the `sent_detect` method picks up periods between numbers, wheras the openNLP `Maxent_Sent_Token_Annotator` identifies these and retreats them as commas before running the sentence identifier, leading to more robust sentence identification – christopherlovell Jan 09 '15 at 16:54
  • 1
    The dev version of qdap (v. 2.2.1) @ GitHub contains the `sent_detect_nlp` to allow flexibility as it uses the method from the **NLP** package. This allows `tm_map(current.corpus, sent_detect_nlp)`. See commit: https://github.com/trinker/qdap/commit/d54fbd95f0caee68a3845353a06fa324d7fdf42e – Tyler Rinker Jan 09 '15 at 17:19
0

I implemented the following code to solve the same problem using the tokenizers package.

# Iterate a list or vector of strings and split into sentences where there are
# periods or question marks
sentences = purrr::map(.x = textList, function(x) {
  return(tokenizers::tokenize_sentences(x))
})

# The code above will return a list of character vectors so unlist
# to give you a character vector of all the sentences
sentences = unlist(sentences)

# Create a corpus from the sentences
corpus = VCorpus(VectorSource(sentences))
Justin Phillips
  • 1,358
  • 2
  • 12
  • 26
-2

The error is meant to be connected with ggplot2 package and the annotate function gives this error, detach the ggplot2 package and then try again. Hopefully it should work.