Extracting sentences using scan() in R

Question

I've been told that I shouldn't use R to scan text (but I have been doing so, anyway, pending the acquisition of other skills) and encountered a problem that has confused me sufficiently to retreat to these fora. Thanks for any help, in advance.

I'm trying to store a large amount of text (e.g., a short story) as a vector of strings, each of which is a separate sentence. I've been doing this using the scan() function, but I am encountering two basic problems: (1) scan() only seems to allow a single separating character, whereas sentences can obviously end in multiple ways. I know how to mark the end of a sentence using regex (e.g. [!?\.], but I don't know of a function in R that uses regular expressions to split text. (2) scan() seems to automatically regard a new line as a new field, whereas I want it to ignore new lines unless they coincide with the end of a sentence.

download.file("http://www.textfiles.com/stories/3lpigs.txt","threelittlepigs.txt")
threelittlepigs_s<-scan("threelittlepigs.txt",character(0),
                    sep=".",quote=NULL)

If I don't include the 'quote=NULL' option, scan() throws the warning that an EOF (end of field, I'm guessing) falls within a quoted string. This produces a handful of multi-line elements/fields, but pretty erratically. I can't seem to discern a pattern.

Sorry if this has been asked before. I'm sure there's an easy solution. I would prefer one that helps me make sense of why scan() isn't working the way I would expect, but if there are better tools to read text in R, please do let me know.

This is a near duplicate of http://stackoverflow.com/q/18712878/602276. See also http://stackoverflow.com/questions/12602652/how-to-count-the-number-of-sentences-in-a-text-in-r — Andrie, Feb 17 '15 at 08:43

score 3 · Accepted Answer · answered Feb 17 '15 at 08:56

R has some really strong text mining capability, with many strong packages. For example, tm, rvest, stringi and others.

But here is a simple example of doing this almost completely in base R. I only use the %>% pipe from magrittr because I think this makes the code a bit more readable.

the specific answer to your question is you can use regular expressions to search for multiple punctuation marks. In the example below I use "[\\.?!] ", meaning a period, question mark or exclamation mark, followed by a space. You may have to experiment.

Try this:

library("magrittr")
url <- "http://www.textfiles.com/stories/3lpigs.txt"

corpus <- url %>% 
  paste(readLines(url), collapse=" ") %>% 
  gsub("http://www.textfiles.com/stories/3lpigs.txt", "", .)

head(corpus)

z <- corpus %>% 
  gsub(" +", " ", .) %>% 
  strsplit(split = "[\\.?!] ")

z[[1]]

The results:

 z[[1]]
 [1] " THE THREE LITTLE PIGS Once upon a time "                                                                                                                                                                                                       
 [2] ""                                                                                                                                                                                                                                               
 [3] ""                                                                                                                                                                                                                                               
 [4] "there were three little pigs, who left their mummy and daddy to see the world"                                                                                                                                                                  
 [5] "All summer long, they roamed through the woods and over the plains,playing games and having fun"                                                                                                                                                
 [6] "None were happier than the three little pigs, and they easily made friends with everyone"                                                                                                                                                       
 [7] "Wherever they went, they were given a warm welcome, but as summer drew to a close, they realized that folk were drifting back to their usual jobs, and preparing for winter"                                                                    
 [8] "Autumn came and it began to rain"                                                                                                                                                                                                               
 [9] "The three little pigs started to feel they needed a real home"                                                                                                                                                                                  
[10] "Sadly they knew that the fun was over now and they must set to work like the others, or they'd be left in the cold and rain, with no roof over their heads"                  

...etc

Extracting sentences using scan() in R

1 Answers1